Comparative analysis of sequencing technologies for single-cell transcriptomics

Abstract

All single-cell RNA-seq protocols and technologies require library preparation prior to sequencing on a platform such as Illumina. Here, we present the first report to utilize the BGISEQ-500 platform for scRNA-seq, and compare the sensitivity and accuracy to Illumina sequencing. We generate a scRNA-seq resource of 468 unique single-cells and 1,297 matched single cDNA samples, performing SMARTer and Smart-seq2 protocols on mESCs and K562 cells with RNA spike-ins. We sequence these libraries on both BGISEQ-500 and Illumina HiSeq platforms using single- and paired-end reads. The two platforms have comparable sensitivity and accuracy in terms of quantification of gene expression, and low technical variability. Our study provides a standardised scRNA-seq resource to benchmark new scRNA-seq library preparation protocols and sequencing platforms.

Data proccessing

SCQUA is a python package that we developed for analyzing the single cell RNA Sequencing quality.

In [1]:
%pylab inline
from SCQUA import *
Populating the interactive namespace from numpy and matplotlib
/Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [2]:
ls ../
Figs/   data/   docs/   jup/    pdata/  pdata1/

Figs include the resulted figures
data include all the raw data
jup include the jupyter notebook to process the data and make the figures
pdata include all the processed data.

In [3]:
ls ../data/
bgi_263_722_ann.xlsx     bgi_ds_e6_tpm.csv.gz     sanger_ds_e4_qc.csv.gz
bgi_cts.csv.gz           bgi_qc.csv.gz            sanger_ds_e4_tpm.csv.gz
bgi_ds_e4_cts.csv.gz     bgi_se_cts.csv.gz        sanger_ds_e5_cts.csv.gz
bgi_ds_e4_qc.csv.gz      bgi_se_qc.csv.gz         sanger_ds_e5_qc.csv.gz
bgi_ds_e4_tpm.csv.gz     bgi_se_tpm.csv.gz        sanger_ds_e5_tpm.csv.gz
bgi_ds_e5_cts.csv.gz     bgi_tpm.csv.gz           sanger_ds_e6_cts.csv.gz
bgi_ds_e5_qc.csv.gz      phn.csv                  sanger_ds_e6_qc.csv.gz
bgi_ds_e5_tpm.csv.gz     sanger.csv               sanger_ds_e6_tpm.csv.gz
bgi_ds_e6_cts.csv.gz     sanger_cts.csv.gz        sanger_qc.csv.gz
bgi_ds_e6_qc.csv.gz      sanger_ds_e4_cts.csv.gz  sanger_tpm.csv.gz

Salmon outputs 3 information, the counts, tpm and qc metrics. We concatenate these information in *cts.gz, *tpm.gz and *qc.gz.

detection limit (sensitivity) and accuracy based on spike-ins can be calculated.

In [3]:
ercc = get_ERCC()
sirv = get_SIRV()
spike = pd.concat([ercc,sirv])

Get results for all the datasets.

In [39]:
for f in iglob('../data/*_tpm.csv.gz'):
    name = f.split('/')[-1].replace('_tpm.csv.gz',"")
    print(name)
    cts = pd.read_csv(f.replace('tpm','cts'), index_col=0)
    tpm = pd.read_csv(f, index_col=0)
    phn = pd.read_csv(f.replace('tpm','qc'), index_col=0)
    df = get_result(tpm, ercc, sirv, spike)
    phn = pd.concat([phn,df,cts.loc[cts.index.str.startswith("ENS")].T], axis=1)
    phn["Total_counts"] = cts.loc[cts.index.str.startswith("ENS")].sum()
    phn["study"] = name
    phn.to_csv("../pdata/%s.csv"%name)

This file is annotation of the all the cells sequenced.

In [7]:
phn = pd.read_csv("../data/phn.csv",index_col=0)
In [3]:
phn.head()
Out[3]:
Batch Cell Cell_line Fullname Place Platform Protocol Protocol2 Read_type Well repeat
ERR1901831 SMARTer 2 ESC SMARTer_2 UK HiSeq-2000 SMARTer HiSeq-2000 SMARTer PE UNK SMARTer
ERR1901830 SMARTer 1 ESC SMARTer_1 UK HiSeq-2000 SMARTer HiSeq-2000 SMARTer PE UNK SMARTer
ERR1901832 SMARTer 3 ESC SMARTer_3 UK HiSeq-2000 SMARTer HiSeq-2000 SMARTer PE UNK SMARTer
ERR1901833 SMARTer 4 ESC SMARTer_4 UK HiSeq-2000 SMARTer HiSeq-2000 SMARTer PE UNK SMARTer
ERR1901834 SMARTer 5 ESC SMARTer_5 UK HiSeq-2000 SMARTer HiSeq-2000 SMARTer PE UNK SMARTer
In [7]:
phn.tail()
Out[7]:
Batch Cell Cell_line Fullname Place Platform Protocol Protocol2 Read_type Well repeat
Index_CWHPEI17060082 ESC_AR_P101_H 8 ESC ESC_A111_P101_H_08 China Hiseq4000 Smart-Seq2 Hiseq4000 ESC Library rep PE101 UNK ESC Library rep
Index_CWHPEI17060083 ESC_AR_P101_H 9 ESC ESC_A111_P101_H_09 China Hiseq4000 Smart-Seq2 Hiseq4000 ESC Library rep PE101 UNK ESC Library rep
Index_CWHPEI17060084 ESC_AR_P101_H 10 ESC ESC_A111_P101_H_10 China Hiseq4000 Smart-Seq2 Hiseq4000 ESC Library rep PE101 UNK ESC Library rep
Index_CWHPEI17060085 ESC_AR_P101_H 11 ESC ESC_A111_P101_H_11 China Hiseq4000 Smart-Seq2 Hiseq4000 ESC Library rep PE101 UNK ESC Library rep
Index_CWHPEI17060086 ESC_AR_P101_H 12 ESC ESC_A111_P101_H_12 China Hiseq4000 Smart-Seq2 Hiseq4000 ESC Library rep PE101 UNK ESC Library rep

map the annotation

In [18]:
for f in iglob("../pdata/*.csv"):
    print(f)
    fout = f.replace("pdata","pdata1")
    df = pd.read_csv(f,index_col=0)
    df = df.loc[df.index.isin(phn.index)]
    dfp = phn.loc[df.index]
    df = pd.concat([df, dfp], axis=1)
    df.to_csv(fout)
../pdata/sanger_ds_e6.csv
../pdata/bgi_ds_e5.csv
../pdata/bgi_ds_e4.csv
../pdata/sanger_ds_e5.csv
../pdata/bgi_ds_e6.csv
../pdata/sanger_ds_e4.csv
../pdata/sanger.csv
../pdata/bgi.csv
../pdata/bgi_se.csv
In [ ]:
 

Fig 1B 1C

(B) Single-cell detection limit (Sensitivity) of mESC cells, downsampled across two orders of magnitude from SMARTer and two Smart-seq2 replicates (633 samples). The single-cell sensitivities are largely similar between different library preparation across scRNA-seq protocols. (C) Single-cell accuracy of mESC cells, downsampled across two orders of magnitude for SMARTer and two Smart-seq2 replicates (633 samples). The grey dotted lines indicate downsampling at different read depths per cell, while red line indicates saturation per cell.

In [79]:
dfs = []
for f in iglob("../pdata1/*_ds_*"):
    df = pd.read_csv(f, index_col=0)
    df["id"] = df.index
    df.index = df.study+"_"+df.id
    df = df[["id",'Cell','Fullname','num_processed',\
             'detection_limit','accuracy','Protocol','Batch','Protocol2']+
            df.columns[df.columns.str.startswith('ENS')].tolist()]
    dfs.append(df)
df = pd.concat(dfs,axis =0)
In [80]:
df.shape
Out[80]:
(1896, 34848)
In [81]:
df.head()
Out[81]:
id Cell Fullname num_processed detection_limit accuracy Protocol Batch Protocol2 ENSMUSG00000107196.1 ... ENSMUSG00000083997.1 ENSMUSG00000111337.1 ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1
sanger_ds_e6_ERR1901834 ERR1901834 5 SMARTer_5 1000000 76.121794 0.500011 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000000 8.27496 1.0 26.4398 0.0
sanger_ds_e6_ERR1901830 ERR1901830 1 SMARTer_1 1000000 111.926773 0.613356 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000982 5.00000 0.0 29.1214 0.0
sanger_ds_e6_ERR1901832 ERR1901832 3 SMARTer_3 1000000 71.153016 0.632203 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 1.01072 0.0 0.0 0.0 0.000000 6.22531 0.0 35.2669 0.0
sanger_ds_e6_ERR1901835 ERR1901835 6 SMARTer_6 1000000 71.030314 0.682535 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000000 0.00000 0.0 30.2924 0.0
sanger_ds_e6_ERR1901833 ERR1901833 4 SMARTer_4 1000000 62.824744 0.628298 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000000 101.10100 0.0 46.1377 0.0

5 rows × 34848 columns

In [82]:
df=df[df.Protocol.str.startswith("S")]

use only the protocols in sanger

In [83]:
df = df.loc[~(df.Protocol2.str.startswith('BGISEQ-500 K562')|\
              df.Protocol2.str.startswith('Hiseq4000')|\
              df.Protocol2.str.startswith('BGISEQ-500 ESC'))]
In [84]:
df.Protocol.value_counts()
Out[84]:
Smart-Seq2    972
SMARTer       576
Name: Protocol, dtype: int64
In [85]:
df.Protocol2.value_counts()
Out[85]:
BGISEQ-500 SMARTer            288
HiSeq-2000 SMARTer            288
HiSeq-2000 Smart-Seq2 rep1    288
BGISEQ-500 Smart-Seq2 rep1    285
BGISEQ-500 Smart-Seq2 rep2    216
HiSeq-2000 Smart-Seq2 rep2    183
Name: Protocol2, dtype: int64
In [86]:
df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, how='any')
In [87]:
df.to_csv("../pdata1/Fig1BC.csv")
In [2]:
df = pd.read_csv("../pdata1/Fig1BC.csv", index_col=0)
In [88]:
figsize(5,5)
matplotlib.rcParams.update({'font.size': 14})
names = ['BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1', 
         'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2', 
         'BGISEQ-500 SMARTer','HiSeq-2000 SMARTer']
ax = fit_sensitivity(df,key2="num_processed",key3='Protocol2', \
                     yscale='log', xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Detection limit \n (# of molecules)', \
                     ylim = [0, 1e4], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Sensitivity', titlesize = 20, names = names)
np.log10(detection_limit) ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                                OLS Regression Results                               
=====================================================================================
Dep. Variable:     np.log10(detection_limit)   R-squared:                       0.713
Model:                                   OLS   Adj. R-squared:                  0.712
Method:                        Least Squares   F-statistic:                     547.6
Date:                       Sat, 01 Dec 2018   Prob (F-statistic):               0.00
Time:                               15:51:18   Log-Likelihood:                -209.80
No. Observations:                       1548   AIC:                             435.6
Df Residuals:                           1540   BIC:                             478.4
Df Model:                                  7                                         
Covariance Type:                   nonrobust                                         
==============================================================================================================
                                                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
Intercept                                      9.0510      0.345     26.211      0.000       8.374       9.728
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]    -0.1199      0.023     -5.167      0.000      -0.165      -0.074
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]    -0.0703      0.025     -2.813      0.005      -0.119      -0.021
C(Protocol2)[T.HiSeq-2000 SMARTer]             0.0790      0.023      3.413      0.001       0.034       0.124
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]     0.0037      0.023      0.161      0.872      -0.042       0.049
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]    -0.0698      0.026     -2.658      0.008      -0.121      -0.018
np.power(np.log10(num_processed), 2)           0.1743      0.014     12.300      0.000       0.146       0.202
np.log10(num_processed)                       -2.2506      0.141    -15.912      0.000      -2.528      -1.973
==============================================================================
Omnibus:                     1287.987   Durbin-Watson:                   1.751
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            93228.151
Skew:                           3.391   Prob(JB):                         0.00
Kurtosis:                      40.409   Cond. No.                     1.44e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
np.log10(num_processed)                      -2.250609
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]   -0.119945
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]   -0.070343
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]   -0.069809
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]    0.003725
C(Protocol2)[T.HiSeq-2000 SMARTer]            0.079032
np.power(np.log10(num_processed), 2)          0.174267
Intercept                                     9.050951
dtype: float64
read_satureation: 2866584
In [89]:
figsize(5,5)
matplotlib.rcParams.update({'font.size': 14})
names = ['BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1', 
         'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2', 
         'BGISEQ-500 SMARTer','HiSeq-2000 SMARTer']
ax = fit_sensitivity(df,key2="num_processed",key3='Protocol2', \
                     yscale='log', xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Detection limit \n (# of molecules)', \
                     ylim = [0, 1e4], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Sensitivity', titlesize = 20, names = names, colordots = True)
np.log10(detection_limit) ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                                OLS Regression Results                               
=====================================================================================
Dep. Variable:     np.log10(detection_limit)   R-squared:                       0.713
Model:                                   OLS   Adj. R-squared:                  0.712
Method:                        Least Squares   F-statistic:                     547.6
Date:                       Sat, 01 Dec 2018   Prob (F-statistic):               0.00
Time:                               15:51:18   Log-Likelihood:                -209.80
No. Observations:                       1548   AIC:                             435.6
Df Residuals:                           1540   BIC:                             478.4
Df Model:                                  7                                         
Covariance Type:                   nonrobust                                         
==============================================================================================================
                                                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
Intercept                                      9.0510      0.345     26.211      0.000       8.374       9.728
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]    -0.1199      0.023     -5.167      0.000      -0.165      -0.074
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]    -0.0703      0.025     -2.813      0.005      -0.119      -0.021
C(Protocol2)[T.HiSeq-2000 SMARTer]             0.0790      0.023      3.413      0.001       0.034       0.124
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]     0.0037      0.023      0.161      0.872      -0.042       0.049
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]    -0.0698      0.026     -2.658      0.008      -0.121      -0.018
np.power(np.log10(num_processed), 2)           0.1743      0.014     12.300      0.000       0.146       0.202
np.log10(num_processed)                       -2.2506      0.141    -15.912      0.000      -2.528      -1.973
==============================================================================
Omnibus:                     1287.987   Durbin-Watson:                   1.751
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            93228.151
Skew:                           3.391   Prob(JB):                         0.00
Kurtosis:                      40.409   Cond. No.                     1.44e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
np.log10(num_processed)                      -2.250609
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]   -0.119945
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]   -0.070343
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]   -0.069809
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]    0.003725
C(Protocol2)[T.HiSeq-2000 SMARTer]            0.079032
np.power(np.log10(num_processed), 2)          0.174267
Intercept                                     9.050951
dtype: float64
read_satureation: 2866584
In [90]:
ax = fit_sensitivity(df, fun = 'accuracy',key1="accuracy",key2="num_processed",key3='Protocol2', \
                     xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
                     ylim = [0, 1], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Accuracy', titlesize = 20, names = names)
accuracy ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               accuracy   R-squared:                       0.350
Model:                            OLS   Adj. R-squared:                  0.347
Method:                 Least Squares   F-statistic:                     118.4
Date:                Sat, 01 Dec 2018   Prob (F-statistic):          3.30e-139
Time:                        15:51:19   Log-Likelihood:                 1019.8
No. Observations:                1548   AIC:                            -2024.
Df Residuals:                    1540   BIC:                            -1981.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================================================
                                                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
Intercept                                     -0.9296      0.156     -5.958      0.000      -1.236      -0.624
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]     0.0515      0.010      4.914      0.000       0.031       0.072
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]     0.0075      0.011      0.666      0.505      -0.015       0.030
C(Protocol2)[T.HiSeq-2000 SMARTer]             0.0119      0.010      1.140      0.255      -0.009       0.032
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]     0.0336      0.010      3.209      0.001       0.013       0.054
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]     0.0287      0.012      2.421      0.016       0.005       0.052
np.power(np.log10(num_processed), 2)          -0.0378      0.006     -5.903      0.000      -0.050      -0.025
np.log10(num_processed)                        0.4839      0.064      7.571      0.000       0.359       0.609
==============================================================================
Omnibus:                       56.216   Durbin-Watson:                   2.061
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               61.865
Skew:                          -0.490   Prob(JB):                     3.68e-14
Kurtosis:                       2.992   Cond. No.                     1.44e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept                                    -0.929621
np.power(np.log10(num_processed), 2)         -0.037794
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]    0.007529
C(Protocol2)[T.HiSeq-2000 SMARTer]            0.011925
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]    0.028737
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]    0.033581
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]    0.051542
np.log10(num_processed)                       0.483914
dtype: float64
read_satureation: 2523440
In [91]:
ax = fit_sensitivity(df, fun = 'accuracy',key1="accuracy",key2="num_processed",key3='Protocol2', \
                     xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
                     ylim = [0, 1], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Accuracy', titlesize = 20, names = names, colordots=True)
accuracy ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               accuracy   R-squared:                       0.350
Model:                            OLS   Adj. R-squared:                  0.347
Method:                 Least Squares   F-statistic:                     118.4
Date:                Sat, 01 Dec 2018   Prob (F-statistic):          3.30e-139
Time:                        15:51:20   Log-Likelihood:                 1019.8
No. Observations:                1548   AIC:                            -2024.
Df Residuals:                    1540   BIC:                            -1981.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================================================
                                                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
Intercept                                     -0.9296      0.156     -5.958      0.000      -1.236      -0.624
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]     0.0515      0.010      4.914      0.000       0.031       0.072
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]     0.0075      0.011      0.666      0.505      -0.015       0.030
C(Protocol2)[T.HiSeq-2000 SMARTer]             0.0119      0.010      1.140      0.255      -0.009       0.032
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]     0.0336      0.010      3.209      0.001       0.013       0.054
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]     0.0287      0.012      2.421      0.016       0.005       0.052
np.power(np.log10(num_processed), 2)          -0.0378      0.006     -5.903      0.000      -0.050      -0.025
np.log10(num_processed)                        0.4839      0.064      7.571      0.000       0.359       0.609
==============================================================================
Omnibus:                       56.216   Durbin-Watson:                   2.061
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               61.865
Skew:                          -0.490   Prob(JB):                     3.68e-14
Kurtosis:                       2.992   Cond. No.                     1.44e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept                                    -0.929621
np.power(np.log10(num_processed), 2)         -0.037794
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep2]    0.007529
C(Protocol2)[T.HiSeq-2000 SMARTer]            0.011925
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep2]    0.028737
C(Protocol2)[T.HiSeq-2000 Smart-Seq2 rep1]    0.033581
C(Protocol2)[T.BGISEQ-500 Smart-Seq2 rep1]    0.051542
np.log10(num_processed)                       0.483914
dtype: float64
read_satureation: 2523440

Fig 1D 1E

(D) PCA for matched single-cell cDNA samples performed using SMARTer and two replicates of Smart-seq2 and sequenced across both sequencing platforms. Red and green colored circles indicate sequencing of matched cDNA on Illumina HiSeq2500 and BGISEQ-500 respectively. The dotted lines represent distance i.e. measure of similarity across sequencing platforms. (E) Single-cell correlations for each scRNA-seq protocol and across sequencing platforms. The correlations (R=0.52~0.70) are comparable between sequencing platforms.

In [2]:
df = pd.read_csv("../pdata1/Fig1BC.csv", index_col=0)
In [57]:
figsize(5,5)
dfs1 = []
dfs2 = []
for p, l in zip(['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer'],[230, 120, 210]):
    df1, df2 = plot_Fig1E(df, protocol = p,\
        lim = l,\
        xlabel = "BGISEQ-500", ylabel="HiSeq-2000",\
        xlabelsize = 18, ylabelsize=18, titlesize = 18, legendsize = 16)
    plt.show()
    dfs1.append(df1)
    dfs2.append(df2)

[PR12&PR8]

In [5]:
def plot_Fig1E(df, \
    protocol = "SMARTer",\
    key1 = 'Batch',\
    key2 = 'Protocol2',\
    key3 = 'Cell',\
    key4 = 'Fullname',\
    lim = 210,\
    cutoff = 1000, \
    xlabel = None, ylabel=None,\
    xlabelsize = None, ylabelsize=None, titlesize = None, legendsize = None):

    dfx = df[df[key1] == protocol]
    dfx = dfx[dfx.index.str.find('e6')>0]
    df1 = dfx[dfx[key2].str.startswith('BGISEQ-500')].sort_values(by=key3)
    df2 = dfx[dfx[key2].str.startswith('HiSeq-2000')].sort_values(by=key3)
    df1 = df1[df1[key4].isin(df2[key4])]
    df2 = df2[df2[key4].isin(df1[key4])]
    
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import scale	

    dfa = pd.concat([df1,df2], axis = 0)
    exprs = np.log1p(dfa.T.loc[dfa.columns.str.startswith("ENS")].astype(np.float32))
    pca = PCA(n_components=50)
    pca_res = pca.fit_transform(scale(exprs.T, 1))

    dfa["PC1"] = pca_res[:,0]
    dfa["PC2"] = pca_res[:,1]
    
    from scipy.spatial.distance import cdist
    
    dfb = dfa[['PC1','PC2']]
    dfb1 = dfb.iloc[:int((dfb.shape[0])/2)]
    dfb2 = dfb.iloc[int((dfb.shape[0])/2):]
    cdst = cdist(dfb1.values,dfb2.values)
#     print(np.diagonal(cdst))

    xx1 = df1.detection_limit[(df2.detection_limit <cutoff).tolist()]
    xx2 = df2.detection_limit[(df2.detection_limit <cutoff).tolist()]
    xx3 = np.diagonal(cdst)[(df2.detection_limit <cutoff).tolist()]

    slope, intercept, r_value, p_value, std_err = \
        scipy.stats.linregress(xx1, xx2)
    from scipy.stats import spearmanr
    r_value, _ = spearmanr(xx1,xx2)
    plt.plot(xx1, xx2, 'k.', label =  "R=%.2f"%r_value)
    plt.plot(xx1[xx3>50], xx2[xx3>50], 'r.', label='')
    print("outlier cells:")
    print(df2.index[np.diagonal(cdst)>50])

    plt.xlim(0,lim)
    plt.ylim(0,lim)

    if xlabel is None: xlabel = key2
    if ylabel is None: ylabel = key1
    if xlabelsize is None:
        plt.xlabel(xlabel)
    else:
        plt.xlabel(xlabel, fontsize=xlabelsize)
    if ylabelsize is None:
        plt.ylabel(ylabel)
    else:
        plt.ylabel(ylabel, fontsize=ylabelsize)
    if not titlesize is None:
        plt.title(protocol, fontsize = titlesize)
    else:
        plt.title(protocol)
    if legendsize is None:
        plt.legend(loc='upper right')
    else:
        plt.legend(loc='upper right', fontsize=legendsize)
    return(df1, df2)

outlier cells are listed here:

In [6]:
figsize(5,5)
dfs1 = []
dfs2 = []
for p, l in zip(['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer'],[230, 120, 210]):
    df1, df2 = plot_Fig1E(df, protocol = p,\
        lim = l,\
        xlabel = "BGISEQ-500", ylabel="HiSeq-2000",\
        xlabelsize = 18, ylabelsize=18, titlesize = 18, legendsize = 16)
    plt.show()
    dfs1.append(df1)
    dfs2.append(df2)
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:25: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
outlier cells:
Index(['sanger_ds_e6_ERR1903903', 'sanger_ds_e6_ERR1903942',
       'sanger_ds_e6_ERR1903954', 'sanger_ds_e6_ERR1903964',
       'sanger_ds_e6_ERR1903967'],
      dtype='object')
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:25: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
outlier cells:
Index([], dtype='object')
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:25: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
outlier cells:
Index(['sanger_ds_e6_ERR1901837', 'sanger_ds_e6_ERR1901853',
       'sanger_ds_e6_ERR1901914'],
      dtype='object')

[PR8 FigS2C]

In [8]:
dfs = []
for f in ['../pdata1/bgi.csv','../pdata1/sanger.csv']:
    df = pd.read_csv(f, index_col=0)
    df["id"] = df.index
    df.index = df.study+"_"+df.id
    df = df[["id",'Cell','Fullname','num_processed',\
             'detection_limit','accuracy','Protocol','Batch','Protocol2']+
            df.columns[df.columns.str.startswith('ENS')].tolist()]
    dfs.append(df)
df = pd.concat(dfs,axis =0)
In [9]:
df=df[df.Protocol.str.startswith("S")]
In [10]:
df = df.loc[~(df.Protocol2.str.startswith('BGISEQ-500 K562')|\
              df.Protocol2.str.startswith('Hiseq4000')|\
              df.Protocol2.str.startswith('BGISEQ-500 ESC'))]
In [11]:
df.Protocol.value_counts()
Out[11]:
Smart-Seq2    324
SMARTer       192
Name: Protocol, dtype: int64
In [12]:
df.Protocol2.value_counts()
Out[12]:
HiSeq-2000 SMARTer            96
BGISEQ-500 SMARTer            96
HiSeq-2000 Smart-Seq2 rep1    96
BGISEQ-500 Smart-Seq2 rep1    95
BGISEQ-500 Smart-Seq2 rep2    72
HiSeq-2000 Smart-Seq2 rep2    61
Name: Protocol2, dtype: int64
In [13]:
df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, how='any')
In [14]:
df.to_csv("../pdata1/FigS2C.csv")
In [141]:
df.shape
Out[141]:
(516, 34848)
In [ ]:
 
In [17]:
def plot_Fig1E(df, \
	protocol = "SMARTer",\
	key1 = 'Batch',\
	key2 = 'Protocol2',\
	key3 = 'Cell',\
	key4 = 'Fullname',\
	lim = 210,\
	cutoff = 1000, \
	xlabel = None, ylabel=None,\
	xlabelsize = None, ylabelsize=None, titlesize = None, legendsize = None):

	dfx = df[df[key1] == protocol]
# 	dfx = dfx[dfx.index.str.find('e6')>0]
	df1 = dfx[dfx[key2].str.startswith('BGISEQ-500')].sort_values(by=key3)
	df2 = dfx[dfx[key2].str.startswith('HiSeq-2000')].sort_values(by=key3)
	df1 = df1[df1[key4].isin(df2[key4])]
	df2 = df2[df2[key4].isin(df1[key4])]

	xx1 = df1.detection_limit[(df2.detection_limit <cutoff).tolist()]
	xx2 = df2.detection_limit[(df2.detection_limit <cutoff).tolist()]

	slope, intercept, r_value, p_value, std_err = \
		scipy.stats.linregress(xx1, xx2)
	from scipy.stats import spearmanr
	r_value, _ = spearmanr(xx1,xx2)
	plt.plot(xx1, xx2, 'k.', label =  "R=%.2f"%r_value)

	plt.xlim(0,lim)
	plt.ylim(0,lim)
	
	if xlabel is None: xlabel = key2
	if ylabel is None: ylabel = key1
	if xlabelsize is None:
		plt.xlabel(xlabel)
	else:
		plt.xlabel(xlabel, fontsize=xlabelsize)
	if ylabelsize is None:
		plt.ylabel(ylabel)
	else:
		plt.ylabel(ylabel, fontsize=ylabelsize)
	if not titlesize is None:
		plt.title(protocol, fontsize = titlesize)
	else:
		plt.title(protocol)
	if legendsize is None:
		plt.legend(loc='upper right')
	else:
		plt.legend(loc='upper right', fontsize=legendsize)
	return(df1, df2)
In [18]:
figsize(5,5)
dfs1 = []
dfs2 = []
for p, l in zip(['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer'],[230, 120, 210]):
    df1, df2 = plot_Fig1E(df, protocol = p,\
        lim = l,\
        xlabel = "BGISEQ-500", ylabel="HiSeq-2000",\
        xlabelsize = 18, ylabelsize=18, titlesize = 18, legendsize = 16)
    plt.show()
    dfs1.append(df1)
    dfs2.append(df2)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [38]:
def plot_Fig1D(df1, df2, \
               protocol = "SMARTer", \
                key2 = 'Protocol2', \
                key3 = 'BGISEQ-500', \
                key4 = 'Fullname',\
                xlabelsize = None, ylabelsize = None, titlesize = None):
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import scale	

    df = pd.concat([df1,df2], axis = 0)
    exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
    pca = PCA(n_components=50)
    pca_res = pca.fit_transform(scale(exprs.T, 1))

    df["PC1"] = pca_res[:,0]
    df["PC2"] = pca_res[:,1]

    g = df.groupby('Protocol2')
    colors = ["#00cc99","#e0115f"]
    for i, group in enumerate(g.indices):
        tmp = g.indices[group]
        plt.scatter(df.PC1.iloc[tmp], df.PC2.iloc[tmp], label=group.replace(protocol,""), lw = 0, c=colors[i])
    from scipy.spatial.distance import cdist
    for i in df[df[key2].str.startswith(key3)][key4]:
        j = df[(df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
        k = df[(~df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
#         d1 = abs(j["PC1"].values - k["PC1"].values)
#         d2 = abs(j["PC2"].values - k["PC2"].values)
#         d = np.sqrt(d1*d1+d2*d2)
        cdst = cdist(j[['PC1','PC2']].values,k[['PC1','PC2']].values)
        d = cdst[0][0]
        if d < 60:
            plt.plot([j["PC1"],k["PC1"]],[j["PC2"],k["PC2"]],'k-.')
        else:
            plt.plot([j["PC1"],k["PC1"]],[j["PC2"],k["PC2"]],'r-.')

    plt.xlabel("PC1 (%.3f)"%pca.explained_variance_ratio_[0], fontsize = xlabelsize)
    plt.ylabel("PC2 (%.3f)"%pca.explained_variance_ratio_[1], fontsize = ylabelsize)
    if not titlesize is None:
        plt.title(protocol, fontsize = titlesize)
    else:
        plt.title(protocol)
    plt.legend(scatterpoints=3, bbox_to_anchor=(1.03, 1), borderaxespad=0.);
In [39]:
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
    plt.clf()
    plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18)
    plt.show()
    i+=1
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [90]:
def plot_Fig1D(df1, df2, \
               protocol = "SMARTer", \
                key2 = 'Protocol2', \
                key3 = 'BGISEQ-500', \
                key4 = 'Fullname',\
                xlabelsize = None, ylabelsize = None, titlesize = None, tp = 1):
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import scale	

    df = pd.concat([df1,df2], axis = 0)
    exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
    pca = PCA(n_components=50)
    pca_res = pca.fit_transform(scale(exprs.T, 1))

    df["PC1"] = pca_res[:,0]
    df["PC2"] = pca_res[:,1]

    if tp ==1:
        g = df.groupby('Protocol2')
        colors = ["#00cc99","#e0115f"]
        for i, group in enumerate(g.indices):
            tmp = g.indices[group]
            plt.scatter(df.PC1.iloc[tmp], df.PC2.iloc[tmp], label=group.replace(protocol,""), lw = 0, c=colors[i])
        for i in df[df[key2].str.startswith(key3)][key4]:
            j = df[(df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
            k = df[(~df[key2].str.startswith(key3)) & (df[key4] == i)][['PC1','PC2']]
            plt.plot([j["PC1"],k["PC1"]],[j["PC2"],k["PC2"]],'k-.')
        plt.legend(scatterpoints=3, bbox_to_anchor=(1.03, 1), borderaxespad=0.);
    else:
        if tp ==2:
            df1 = df[df.columns[df.columns.str.startswith('ENS')]]
            df['n_genes'] = np.sum(df1>0,axis=1)
            plt.scatter(df.PC1, df.PC2, lw = 0, c=df['n_genes'])
            plt.colorbar()
        elif tp ==3:
            df['Nanog'] = df['ENSMUSG00000012396.12']
            plt.scatter(df.PC1, df.PC2, lw = 0, c=df['Nanog'])
            plt.colorbar()
        elif tp ==4:
            df['Pou5f1'] = df['ENSMUSG00000024406.16']
            plt.scatter(df.PC1, df.PC2, lw = 0, c=df['Pou5f1'])
            plt.colorbar()
        elif tp ==5:
            df['Pax6'] = df['ENSMUSG00000027168.21']
            plt.scatter(df.PC1, df.PC2, lw = 0, c=df['Pax6'])
            plt.colorbar()
            
    plt.xlabel("PC1 (%.3f)"%pca.explained_variance_ratio_[0], fontsize = xlabelsize)
    plt.ylabel("PC2 (%.3f)"%pca.explained_variance_ratio_[1], fontsize = ylabelsize)
    if not titlesize is None:
        plt.title(protocol, fontsize = titlesize)
    else:
        plt.title(protocol)
    

[PR6]

In [72]:
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
    plt.clf()
    plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=2)
    plt.show()
    i+=1
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
In [73]:
df.shape
Out[73]:
(1548, 34848)
In [75]:
ls ../data/GRCm38.cdna.all.symbol.tsv
../data/GRCm38.cdna.all.symbol.tsv
In [84]:
df1 = pd.read_csv('../data/GRCm38.cdna.all.symbol.tsv', sep=' ',header=None, index_col=1)
In [86]:
df1.loc['Nanog']
Out[86]:
0    ENSMUSG00000012396.12
Name: Nanog, dtype: object
In [88]:
df1.loc['Pou5f1']
Out[88]:
0    ENSMUSG00000024406.16
Name: Pou5f1, dtype: object
In [89]:
df1.loc['Pax6']
Out[89]:
0    ENSMUSG00000027168.21
Name: Pax6, dtype: object
In [ ]:
 

[cells in the red circle]

In [70]:
df = pd.concat([dfs1[0],dfs2[0]])
In [72]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))

df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:5: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
In [74]:
df[['PC1','PC2']].to_csv("PCA_cell_SM2rep1.csv")
In [75]:
df = pd.concat([dfs1[1],dfs2[1]])
In [76]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))

df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:5: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
In [77]:
df[['PC1','PC2']].to_csv("PCA_cell_SM2rep2.csv")
In [ ]:
 
In [78]:
df = pd.concat([dfs1[2],dfs2[2]])
In [79]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
exprs = np.log1p(df.T.loc[df.columns.str.startswith("ENS")].astype(np.float32))
pca = PCA(n_components=50)
pca_res = pca.fit_transform(scale(exprs.T, 1))

df["PC1"] = pca_res[:,0]
df["PC2"] = pca_res[:,1]
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:5: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
In [80]:
df[['PC1','PC2']].to_csv("PCA_cell_SMARTer.csv")
In [ ]:
 
In [ ]:
 
In [ ]:
 

[PR5]

Nanog

In [91]:
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
    plt.clf()
    plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=3)
    plt.show()
    i+=1
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.

Pou5f1

In [92]:
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
    plt.clf()
    plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=4)
    plt.show()
    i+=1
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.

Pax6

In [93]:
i=0
protocols = ['Smart-Seq2 rep1','Smart-Seq2 rep2','SMARTer']
for df1, df2 in zip(dfs1, dfs2):
    plt.clf()
    plot_Fig1D(df1, df2, protocol = protocols[i], xlabelsize = 18, ylabelsize=18, titlesize = 18,tp=5)
    plt.show()
    i+=1
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning:Data with input dtype float32 were all converted to float64 by the scale function.
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Fig 2B

In [31]:
dfs = []
for f in iglob("../pdata1/*_ds_*"):
    df = pd.read_csv(f, index_col=0)
    df["id"] = df.index
    df.index = df.study+"_"+df.id
    df = df[["id",'Cell','Fullname','num_processed',\
             'detection_limit_ERCC','accuracy_ERCC','Protocol','Batch','Protocol2']+
            df.columns[df.columns.str.startswith('ENS')].tolist()]
    dfs.append(df)
df = pd.concat(dfs,axis =0)
In [32]:
df.shape
Out[32]:
(1896, 34848)
In [33]:
df.head()
Out[33]:
id Cell Fullname num_processed detection_limit_ERCC accuracy_ERCC Protocol Batch Protocol2 ENSMUSG00000107196.1 ... ENSMUSG00000083997.1 ENSMUSG00000111337.1 ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1
sanger_ds_e6_ERR1901834 ERR1901834 5 SMARTer_5 1000000 216.716935 0.724199 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000000 8.27496 1.0 26.4398 0.0
sanger_ds_e6_ERR1901830 ERR1901830 1 SMARTer_1 1000000 388.738352 0.742601 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000982 5.00000 0.0 29.1214 0.0
sanger_ds_e6_ERR1901832 ERR1901832 3 SMARTer_3 1000000 135.419150 0.813372 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 1.01072 0.0 0.0 0.0 0.000000 6.22531 0.0 35.2669 0.0
sanger_ds_e6_ERR1901835 ERR1901835 6 SMARTer_6 1000000 203.113225 0.926988 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000000 0.00000 0.0 30.2924 0.0
sanger_ds_e6_ERR1901833 ERR1901833 4 SMARTer_4 1000000 212.025597 0.786830 SMARTer SMARTer HiSeq-2000 SMARTer 0.0 ... 0.0 0.00000 0.0 0.0 0.0 0.000000 101.10100 0.0 46.1377 0.0

5 rows × 34848 columns

In [34]:
df.Protocol.value_counts()
Out[34]:
Smart-Seq2    1320
SMARTer        576
Name: Protocol, dtype: int64
In [35]:
df.Protocol2.value_counts()
Out[35]:
BGISEQ-500 SMARTer                 288
HiSeq-2000 Smart-Seq2 rep1         288
HiSeq-2000 SMARTer                 288
BGISEQ-500 Smart-Seq2 rep1         285
BGISEQ-500 Smart-Seq2 rep2         216
HiSeq-2000 Smart-Seq2 rep2         183
Hiseq4000 K562 Sequencing rep1     111
BGISEQ-500 K562 Seuqencing rep1    111
Hiseq4000 ESC Sequencing rep1       63
BGISEQ-500 ESC Sequencing rep1      63
Name: Protocol2, dtype: int64
In [36]:
df['Protocol2'] = df['Protocol2'].str.replace("Seuqencing",'Sequencing')
In [37]:
df = df[df.Protocol2.str.find("Sequencing")>0]
In [38]:
df.Protocol2.value_counts()
Out[38]:
BGISEQ-500 K562 Sequencing rep1    111
Hiseq4000 K562 Sequencing rep1     111
Hiseq4000 ESC Sequencing rep1       63
BGISEQ-500 ESC Sequencing rep1      63
Name: Protocol2, dtype: int64
In [39]:
df.to_csv("../pdata1/Fig2B.csv")
In [2]:
df = pd.read_csv("../pdata1/Fig2B.csv", index_col=0)
In [3]:
df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, how='any')
In [5]:
figsize(5,5)
matplotlib.rcParams.update({'font.size': 14})
names = ['BGISEQ-500 ESC Sequencing rep1', 
         'Hiseq4000 ESC Sequencing rep1', 
         'BGISEQ-500 K562 Sequencing rep1','Hiseq4000 K562 Sequencing rep1']
colors = [matplotlib.colors.rgb2hex(plt.get_cmap("Paired")(i)) for i in range(6,10)]
ax = fit_sensitivity(df,fun='np.log10(detection_limit_ERCC)', key1 ='detection_limit_ERCC',key2="num_processed",key3='Protocol2', \
                     yscale='log', xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Detection limit \n (# of molecules)', \
                     ylim = [0, 1e5], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Sensitivity', titlesize = 20, names = names, colors = colors)
np.log10(detection_limit_ERCC) ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                                  OLS Regression Results                                  
==========================================================================================
Dep. Variable:     np.log10(detection_limit_ERCC)   R-squared:                       0.936
Model:                                        OLS   Adj. R-squared:                  0.935
Method:                             Least Squares   F-statistic:                     1000.
Date:                            Sat, 01 Dec 2018   Prob (F-statistic):          1.08e-201
Time:                                    17:31:41   Log-Likelihood:                 55.039
No. Observations:                             348   AIC:                            -98.08
Df Residuals:                                 342   BIC:                            -74.97
Df Model:                                       5                                         
Covariance Type:                        nonrobust                                         
===================================================================================================================
                                                      coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept                                          11.8042      0.581     20.309      0.000      10.661      12.947
C(Protocol2)[T.BGISEQ-500 K562 Sequencing rep1]    -1.0821      0.033    -32.922      0.000      -1.147      -1.017
C(Protocol2)[T.Hiseq4000 ESC Sequencing rep1]       0.0596      0.037      1.606      0.109      -0.013       0.133
C(Protocol2)[T.Hiseq4000 K562 Sequencing rep1]     -1.0230      0.033    -31.123      0.000      -1.088      -0.958
np.power(np.log10(num_processed), 2)                0.2432      0.024     10.264      0.000       0.197       0.290
np.log10(num_processed)                            -3.1457      0.237    -13.254      0.000      -3.613      -2.679
==============================================================================
Omnibus:                        7.967   Durbin-Watson:                   1.806
Prob(Omnibus):                  0.019   Jarque-Bera (JB):               12.680
Skew:                           0.085   Prob(JB):                      0.00176
Kurtosis:                       3.920   Cond. No.                     1.54e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
np.log10(num_processed)                            -3.145747
C(Protocol2)[T.BGISEQ-500 K562 Sequencing rep1]    -1.082144
C(Protocol2)[T.Hiseq4000 K562 Sequencing rep1]     -1.023011
C(Protocol2)[T.Hiseq4000 ESC Sequencing rep1]       0.059639
np.power(np.log10(num_processed), 2)                0.243209
Intercept                                          11.804248
dtype: float64
read_satureation: 2932078
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_base.py:3477: UserWarning:Attempted to set non-positive ylimits for log-scale axis; invalid limits will be ignored.
In [6]:
ax = fit_sensitivity(df, fun = 'accuracy_ERCC',key1="accuracy_ERCC",key2="num_processed",key3='Protocol2', \
                     xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
                     ylim = [0, 1], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Accuracy', titlesize = 20, names = names, colors = colors)
accuracy_ERCC ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          accuracy_ERCC   R-squared:                       0.393
Model:                            OLS   Adj. R-squared:                  0.384
Method:                 Least Squares   F-statistic:                     44.20
Date:                Sat, 01 Dec 2018   Prob (F-statistic):           4.05e-35
Time:                        17:33:18   Log-Likelihood:                 546.46
No. Observations:                 348   AIC:                            -1081.
Df Residuals:                     342   BIC:                            -1058.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
===================================================================================================================
                                                      coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept                                          -0.1358      0.142     -0.959      0.338      -0.414       0.143
C(Protocol2)[T.BGISEQ-500 K562 Sequencing rep1]     0.0395      0.008      4.927      0.000       0.024       0.055
C(Protocol2)[T.Hiseq4000 ESC Sequencing rep1]       0.0280      0.009      3.101      0.002       0.010       0.046
C(Protocol2)[T.Hiseq4000 K562 Sequencing rep1]      0.0261      0.008      3.264      0.001       0.010       0.042
np.power(np.log10(num_processed), 2)               -0.0302      0.006     -5.233      0.000      -0.042      -0.019
np.log10(num_processed)                             0.3455      0.058      5.974      0.000       0.232       0.459
==============================================================================
Omnibus:                       23.926   Durbin-Watson:                   1.839
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.946
Skew:                          -0.482   Prob(JB):                     1.56e-08
Kurtosis:                       4.244   Cond. No.                     1.54e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept                                         -0.135786
np.power(np.log10(num_processed), 2)              -0.030211
C(Protocol2)[T.Hiseq4000 K562 Sequencing rep1]     0.026136
C(Protocol2)[T.Hiseq4000 ESC Sequencing rep1]      0.028049
C(Protocol2)[T.BGISEQ-500 K562 Sequencing rep1]    0.039458
np.log10(num_processed)                            0.345463
dtype: float64
read_satureation: 521794
In [7]:
ax = fit_sensitivity(df, fun = 'accuracy_ERCC',key1="accuracy_ERCC",key2="num_processed",key3='Protocol2', \
                     xlabel = 'Sequenced reads per single-well',\
                     ylabel = 'Accuracy \n (pearson correlation \n coefficient)', \
                     ylim = [0, 1], xlim = [500, 1e8], \
                     xlabelsize = 18, ylabelsize = 18, \
                     title = 'Accuracy', titlesize = 20, names = names, colors = colors, colordots=True)
accuracy_ERCC ~ np.power(np.log10(num_processed), 2) + np.log10(num_processed) + C(Protocol2) + 1
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          accuracy_ERCC   R-squared:                       0.393
Model:                            OLS   Adj. R-squared:                  0.384
Method:                 Least Squares   F-statistic:                     44.20
Date:                Sat, 01 Dec 2018   Prob (F-statistic):           4.05e-35
Time:                        17:34:24   Log-Likelihood:                 546.46
No. Observations:                 348   AIC:                            -1081.
Df Residuals:                     342   BIC:                            -1058.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
===================================================================================================================
                                                      coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept                                          -0.1358      0.142     -0.959      0.338      -0.414       0.143
C(Protocol2)[T.BGISEQ-500 K562 Sequencing rep1]     0.0395      0.008      4.927      0.000       0.024       0.055
C(Protocol2)[T.Hiseq4000 ESC Sequencing rep1]       0.0280      0.009      3.101      0.002       0.010       0.046
C(Protocol2)[T.Hiseq4000 K562 Sequencing rep1]      0.0261      0.008      3.264      0.001       0.010       0.042
np.power(np.log10(num_processed), 2)               -0.0302      0.006     -5.233      0.000      -0.042      -0.019
np.log10(num_processed)                             0.3455      0.058      5.974      0.000       0.232       0.459
==============================================================================
Omnibus:                       23.926   Durbin-Watson:                   1.839
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.946
Skew:                          -0.482   Prob(JB):                     1.56e-08
Kurtosis:                       4.244   Cond. No.                     1.54e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Intercept                                         -0.135786
np.power(np.log10(num_processed), 2)              -0.030211
C(Protocol2)[T.Hiseq4000 K562 Sequencing rep1]     0.026136
C(Protocol2)[T.Hiseq4000 ESC Sequencing rep1]      0.028049
C(Protocol2)[T.BGISEQ-500 K562 Sequencing rep1]    0.039458
np.log10(num_processed)                            0.345463
dtype: float64
read_satureation: 521794

Pseudobulk

In [165]:
dfs = []
for f in ['bgi','sanger']:
    df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
    dfs.append(df)
df = pd.concat(dfs, axis=0)
In [166]:
df.shape
Out[166]:
(632, 34866)
In [167]:
df['Protocol2'].value_counts()
Out[167]:
BGISEQ-500 SMARTer                 96
HiSeq-2000 Smart-Seq2 rep1         96
HiSeq-2000 SMARTer                 96
BGISEQ-500 Smart-Seq2 rep1         95
BGISEQ-500 Smart-Seq2 rep2         72
HiSeq-2000 Smart-Seq2 rep2         61
BGISEQ-500 K562 Seuqencing rep1    37
Hiseq4000 K562 Sequencing rep1     37
Hiseq4000 ESC Sequencing rep1      21
BGISEQ-500 ESC Sequencing rep1     21
Name: Protocol2, dtype: int64
In [21]:
df1 = df[df.columns[df.columns.str.startswith('ENS')|df.columns.str.startswith('Protocol2')]]
df1 = df1.groupby('Protocol2').sum()
In [23]:
df1.to_csv("../pdata1/pseudobulk.csv")
In [91]:
df1 = pd.read_csv("../pdata1/pseudobulk.csv", index_col=0)
In [92]:
df1.head(2)
Out[92]:
ENSMUSG00000107196.1 ENSMUSG00000106901.1 ENSMUSG00000107077.1 ENSMUSG00000097549.2 ENSMUSG00000104359.1 ENSMUSG00000107088.1 ENSMUSG00000111721.1 ENSMUSG00000111588.1 ENSMUSG00000107210.1 ENSMUSG00000097359.2 ... ENSMUSG00000083997.1 ENSMUSG00000111337.1 ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1
Protocol2
BGISEQ-500 ESC Sequencing rep1 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 1.5 ... 0.00000 1.00150 0.0 0.0 3.000 93910.630000 3727.07596 435.45573 8266.4670 8.00000
BGISEQ-500 K562 Seuqencing rep1 0.5 0.0 14.0127 0.0 0.0 0.0 0.5 0.0 0.0 34.5 ... 6.11582 40.14731 1.0 4.0 56.755 903.504673 68.28600 154.03886 4619.9553 1.04841

2 rows × 34839 columns

In [170]:
df1.index
Out[170]:
Index(['BGISEQ-500 ESC Sequencing rep1', 'BGISEQ-500 K562 Seuqencing rep1',
       'BGISEQ-500 SMARTer', 'BGISEQ-500 Smart-Seq2 rep1',
       'BGISEQ-500 Smart-Seq2 rep2', 'HiSeq-2000 SMARTer',
       'HiSeq-2000 Smart-Seq2 rep1', 'HiSeq-2000 Smart-Seq2 rep2',
       'Hiseq4000 ESC Sequencing rep1', 'Hiseq4000 K562 Sequencing rep1'],
      dtype='object', name='Protocol2')
In [171]:
l1 = df1.index[df1.index.str.startswith('BGI')].tolist()
l2 = df1.index[df1.index.str.startswith('Hi')]
In [54]:
l1 = ['BGISEQ-500 ESC Sequencing rep1',
 'BGISEQ-500 K562 Seuqencing rep1',
 'BGISEQ-500 SMARTer',
 'BGISEQ-500 Smart-Seq2 rep1',
 'BGISEQ-500 Smart-Seq2 rep2']
In [55]:
l2 = ['Hiseq4000 ESC Sequencing rep1',
 'Hiseq4000 K562 Sequencing rep1',
 'HiSeq-2000 SMARTer',
 'HiSeq-2000 Smart-Seq2 rep1',
 'HiSeq-2000 Smart-Seq2 rep2']
In [56]:
names = []
corrs = []
for i,j in zip(l1,l2):
    cor = df1.loc[i].corr(df1.loc[j], method='pearson')
    corrs.append(cor)
    names.append(i.replace('BGISEQ-500 ',''))
In [57]:
df = pd.DataFrame({'name':names,'corr':corrs})

This is the pseudobulk table.

detection_limit and accuracies can only be table like this.

the calculation of detection_limit and accuracies need some changes in the analysis workflow.

In [58]:
df
Out[58]:
corr name
0 0.979305 ESC Sequencing rep1
1 0.976184 K562 Seuqencing rep1
2 0.976761 SMARTer
3 0.966934 Smart-Seq2 rep1
4 0.933008 Smart-Seq2 rep2
In [93]:
df = pd.DataFrame(numpy.corrcoef(df1), index = df1.index, columns = df1.index)

This is an all-to-all correlation table.

show a heatmap

In [94]:
df
Out[94]:
Protocol2 BGISEQ-500 ESC Sequencing rep1 BGISEQ-500 K562 Seuqencing rep1 BGISEQ-500 SMARTer BGISEQ-500 Smart-Seq2 rep1 BGISEQ-500 Smart-Seq2 rep2 HiSeq-2000 SMARTer HiSeq-2000 Smart-Seq2 rep1 HiSeq-2000 Smart-Seq2 rep2 Hiseq4000 ESC Sequencing rep1 Hiseq4000 K562 Sequencing rep1
Protocol2
BGISEQ-500 ESC Sequencing rep1 1.000000 0.260707 0.675334 0.699834 0.701041 0.648285 0.695722 0.705250 0.979305 0.232673
BGISEQ-500 K562 Seuqencing rep1 0.260707 1.000000 0.310025 0.369512 0.355432 0.282112 0.374829 0.368372 0.282363 0.976184
BGISEQ-500 SMARTer 0.675334 0.310025 1.000000 0.833723 0.806252 0.976761 0.864373 0.888879 0.673151 0.271648
BGISEQ-500 Smart-Seq2 rep1 0.699834 0.369512 0.833723 1.000000 0.975974 0.764209 0.966934 0.915020 0.739477 0.334754
BGISEQ-500 Smart-Seq2 rep2 0.701041 0.355432 0.806252 0.975974 1.000000 0.747264 0.951291 0.933008 0.745700 0.322431
HiSeq-2000 SMARTer 0.648285 0.282112 0.976761 0.764209 0.747264 1.000000 0.829851 0.889031 0.629616 0.247128
HiSeq-2000 Smart-Seq2 rep1 0.695722 0.374829 0.864373 0.966934 0.951291 0.829851 1.000000 0.948526 0.730025 0.338473
HiSeq-2000 Smart-Seq2 rep2 0.705250 0.368372 0.888879 0.915020 0.933008 0.889031 0.948526 1.000000 0.713776 0.332444
Hiseq4000 ESC Sequencing rep1 0.979305 0.282363 0.673151 0.739477 0.745700 0.629616 0.730025 0.713776 1.000000 0.253058
Hiseq4000 K562 Sequencing rep1 0.232673 0.976184 0.271648 0.334754 0.322431 0.247128 0.338473 0.332444 0.253058 1.000000
In [47]:
df.index
Out[47]:
Index(['BGISEQ-500 ESC Sequencing rep1', 'BGISEQ-500 K562 Seuqencing rep1',
       'BGISEQ-500 SMARTer', 'BGISEQ-500 Smart-Seq2 rep1',
       'BGISEQ-500 Smart-Seq2 rep2', 'HiSeq-2000 SMARTer',
       'HiSeq-2000 Smart-Seq2 rep1', 'HiSeq-2000 Smart-Seq2 rep2',
       'Hiseq4000 ESC Sequencing rep1', 'Hiseq4000 K562 Sequencing rep1'],
      dtype='object', name='Protocol2')
In [185]:
cols =['BGISEQ-500 SMARTer', 'HiSeq-2000 SMARTer','BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1',
       'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2','BGISEQ-500 ESC Sequencing rep1', 
       'Hiseq4000 ESC Sequencing rep1', 'BGISEQ-500 K562 Seuqencing rep1',
       'Hiseq4000 K562 Sequencing rep1']
In [95]:
cols =['BGISEQ-500 SMARTer', 'HiSeq-2000 SMARTer','BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1',
       'BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2']
In [96]:
df = df[cols]
In [97]:
df = df.loc[cols]

[UK only plot here:]

In [99]:
figsize(6,6)
fig, ax = plt.subplots()
im = ax.imshow(df.values)

# We want to show all ticks...
ax.set_xticks(np.arange(df.shape[0]))
ax.set_yticks(np.arange(df.shape[0]))
# ... and label them with the respective list entries
ax.set_xticklabels(df.index)
ax.set_yticklabels(df.index)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(df.shape[0]):
    for j in range(df.shape[0]):
        text = ax.text(j, i, '%.2f'%df.iloc[i, j],
                       ha="center", va="center", color="w")

# ax.set_title("Harvest of local farmers (in tons/year)")
fig.tight_layout()
# plt.show()
plt.savefig("Heatmap.pdf")
In [ ]:
 
In [188]:
figsize(6,6)
fig, ax = plt.subplots()
im = ax.imshow(df.values)

# We want to show all ticks...
ax.set_xticks(np.arange(df.shape[0]))
ax.set_yticks(np.arange(df.shape[0]))
# ... and label them with the respective list entries
ax.set_xticklabels(df.index)
ax.set_yticklabels(df.index)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(df.shape[0]):
    for j in range(df.shape[0]):
        text = ax.text(j, i, '%.2f'%df.iloc[i, j],
                       ha="center", va="center", color="w")

# ax.set_title("Harvest of local farmers (in tons/year)")
fig.tight_layout()
plt.show()
In [ ]:
 
In [ ]:
 
In [ ]:
 

Venn plot

In [81]:
from matplotlib_venn import venn3
In [129]:
df = pd.read_csv("../pdata1/pseudobulk.csv", index_col=0)
In [83]:
df
Out[83]:
ENSMUSG00000107196.1 ENSMUSG00000106901.1 ENSMUSG00000107077.1 ENSMUSG00000097549.2 ENSMUSG00000104359.1 ENSMUSG00000107088.1 ENSMUSG00000111721.1 ENSMUSG00000111588.1 ENSMUSG00000107210.1 ENSMUSG00000097359.2 ... ENSMUSG00000083997.1 ENSMUSG00000111337.1 ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1
Protocol2
BGISEQ-500 ESC Sequencing rep1 0.000000 0.000000 0.000000 0.0 0.00000 0.0 0.0 0.0 0.00000 1.5 ... 0.00000 1.00150 0.0 0.0 3.000000 93910.630000 3727.07596 435.455730 8266.467000 8.00000
BGISEQ-500 K562 Seuqencing rep1 0.500000 0.000000 14.012700 0.0 0.00000 0.0 0.5 0.0 0.00000 34.5 ... 6.11582 40.14731 1.0 4.0 56.755000 903.504673 68.28600 154.038860 4619.955300 1.04841
BGISEQ-500 SMARTer 0.000030 0.499806 4.350594 0.0 0.00000 49.0 0.0 4.5 2.50142 2.0 ... 42.04011 24.06719 2.0 0.0 15.000000 5876.940684 7344.42980 89.347747 25920.176150 5.00000
BGISEQ-500 Smart-Seq2 rep1 0.500000 0.000000 1.000000 0.0 0.00000 6.5 0.5 21.0 2.98755 0.5 ... 1.00317 12.03787 0.0 0.0 13.677450 112372.769161 7700.67955 195.736000 31699.557991 3.00100
BGISEQ-500 Smart-Seq2 rep2 0.500000 0.000000 0.000000 0.0 16.70700 0.0 0.5 56.5 0.00000 0.0 ... 33.01779 2.00152 0.0 1.0 6.000000 395535.680900 7341.01971 453.187161 41082.168500 0.00000
HiSeq-2000 SMARTer 0.332127 0.000000 1.681241 0.0 0.00000 13.0 0.0 0.5 1.67056 0.5 ... 15.00129 3.00726 0.0 0.0 4.000000 197.692855 2872.91187 10.559021 6365.110330 2.00000
HiSeq-2000 Smart-Seq2 rep1 0.000000 0.500000 0.666667 0.0 0.00000 5.5 0.0 5.0 1.12381 0.0 ... 3.00107 4.39114 0.0 0.0 4.000000 2911.986450 2988.47902 37.819600 8361.767588 1.00000
HiSeq-2000 Smart-Seq2 rep2 0.000000 0.000000 0.176734 0.0 3.38753 0.0 0.0 2.5 0.00000 0.0 ... 1.00000 2.00328 0.0 0.0 1.000000 1792.305916 2115.65041 47.958863 5248.961560 0.00000
Hiseq4000 ESC Sequencing rep1 0.000000 0.000000 0.000000 0.0 0.00000 0.0 0.0 0.0 0.00000 1.5 ... 1.00173 0.00000 0.0 0.0 4.000000 59157.470000 1528.69916 121.437330 6019.610000 1.00000
Hiseq4000 K562 Sequencing rep1 0.000000 0.000000 4.333330 0.0 0.00000 0.0 0.0 0.0 0.00000 3.0 ... 3.08179 3.00126 0.0 1.0 43.603843 529.447774 14.00000 32.433880 2304.653200 0.00000

10 rows × 34839 columns

In [43]:
df.shape
Out[43]:
(10, 34839)
In [84]:
from matplotlib_venn import venn2
In [83]:
i=0
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
BGISEQ-500 ESC Sequencing rep1
Out[83]:
<matplotlib_venn._common.VennDiagram at 0x1c343adb00>
In [84]:
i=1
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
BGISEQ-500 K562 Seuqencing rep1
Out[84]:
<matplotlib_venn._common.VennDiagram at 0x1c39491780>

[PR4]

83A overlap with 84A. 83B overlap with 84B. overlap in 83 and overlap in 84.

In [85]:
# print(df.index[i])
name1 = 'BGISEQ-500 K562 Seuqencing rep1'
name2 = 'BGISEQ-500 ESC Sequencing rep1'
venn2([set(df.loc[name1][df.loc[name1]>10].index), set(df.loc[name2][df.loc[name2]>10].index)],set_labels=(name1,name2))
Out[85]:
<matplotlib_venn._common.VennDiagram at 0x1c2e13a198>
In [87]:
# print(df.index[i])
name1 = 'Hiseq4000 K562 Sequencing rep1'
name2 = 'Hiseq4000 ESC Sequencing rep1'
venn2([set(df.loc[name1][df.loc[name1]>10].index), set(df.loc[name2][df.loc[name2]>10].index)],set_labels=(name1,name2))
Out[87]:
<matplotlib_venn._common.VennDiagram at 0x1c305bf2e8>
In [130]:
name1 = 'Hiseq4000 K562 Sequencing rep1'
name2 = 'Hiseq4000 ESC Sequencing rep1'
x1 = df.loc[name1][df.loc[name1]>10].index
x2 = df.loc[name2][df.loc[name2]>10].index
name1 = 'BGISEQ-500 K562 Seuqencing rep1'
name2 = 'BGISEQ-500 ESC Sequencing rep1'
x3 = df.loc[name1][df.loc[name1]>10].index
x4 = df.loc[name2][df.loc[name2]>10].index
In [ ]:
 
In [131]:
overlap = x3[x3.isin(x4)]
In [134]:
pd.Series(overlap).to_csv("overlap.csv")
In [135]:
non1 = x3[~x3.isin(x4)]
In [136]:
pd.Series(non1).to_csv("k562only_genes.csv")
In [137]:
non2 = x4[~x4.isin(x3)]
In [138]:
pd.Series(non2).to_csv("ESConly_genes.csv")
In [ ]:
 
In [ ]:
 
In [89]:
x5 = x3[x3.isin(x1)]
x6 = x4[x4.isin(x2)]
In [90]:
venn2([set(x5), set(x6)],set_labels=('K562','ESC'))
Out[90]:
<matplotlib_venn._common.VennDiagram at 0x1c3c82b438>
In [ ]:
 
In [85]:
i=2
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
BGISEQ-500 SMARTer
Out[85]:
<matplotlib_venn._common.VennDiagram at 0x1c37663e48>
In [86]:
i=3
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
BGISEQ-500 Smart-Seq2 rep1
Out[86]:
<matplotlib_venn._common.VennDiagram at 0x1c35c77f60>
In [87]:
i=4
print(df.index[i])
venn2([set(df.loc[l1[i]][df.loc[l1[i]]>0].index), set(df.loc[l2[i]][df.loc[l2[i]]>0].index)])
BGISEQ-500 Smart-Seq2 rep2
Out[87]:
<matplotlib_venn._common.VennDiagram at 0x1c35c9ccc0>

CV^2 vs. Mean

In [7]:
dfs = []
for f in ['bgi','sanger']:
    df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
    dfs.append(df)
df = pd.concat(dfs, axis=0)

df1 = df[df.columns[df.columns.str.startswith('ENS')|df.columns.str.startswith('Protocol2')]]
In [8]:
df = df1.T
In [9]:
names = df.loc['Protocol2'].unique()
In [10]:
for s in df.loc['Protocol2'].unique():
#     print(s)
    df['%s_mean'%s] = np.mean(df.loc[:,df.loc['Protocol2'] == s].iloc[:-1,:],1)
    df['%s_var'%s] = np.var(df.loc[:,df.loc['Protocol2'] == s].iloc[:-1,:],1)
    df['%s_drop'%s] = np.sum((df.loc[:,df.loc['Protocol2'] == s].iloc[:-1,:]==0),1) \
    / df.loc[:,df.loc['Protocol2'] == s].shape[1]
    df['%s_cv2'%s] = df['%s_var'%s]/(df['%s_mean'%s]**2)
df = df.iloc[:-1,:]
In [11]:
figsize(5,5)
plt.loglog()
for s in names:
    plt.scatter(df['%s_mean'%s], df['%s_var'%s], s = 0.1, label = s)
plt.xlim(1e-6, 1e5)
plt.ylim(1e-9, 1e10);
plt.legend(bbox_to_anchor=(1.05, 1))
plt.xlabel("Means")
plt.ylabel("variances")
Out[11]:
Text(0,0.5,'variances')

[PR1] split into 5 paired ones.

In [12]:
names
Out[12]:
array(['BGISEQ-500 Smart-Seq2 rep2', 'BGISEQ-500 Smart-Seq2 rep1',
       'BGISEQ-500 SMARTer', 'BGISEQ-500 K562 Seuqencing rep1',
       'BGISEQ-500 ESC Sequencing rep1', 'Hiseq4000 K562 Sequencing rep1',
       'Hiseq4000 ESC Sequencing rep1', 'HiSeq-2000 SMARTer',
       'HiSeq-2000 Smart-Seq2 rep1', 'HiSeq-2000 Smart-Seq2 rep2'],
      dtype=object)
In [13]:
namet = [['BGISEQ-500 Smart-Seq2 rep2','HiSeq-2000 Smart-Seq2 rep2'],
         ['BGISEQ-500 Smart-Seq2 rep1','HiSeq-2000 Smart-Seq2 rep1'],
         ['BGISEQ-500 SMARTer','HiSeq-2000 SMARTer'],
         ['BGISEQ-500 K562 Seuqencing rep1','Hiseq4000 K562 Sequencing rep1',],
         ['BGISEQ-500 ESC Sequencing rep1','Hiseq4000 ESC Sequencing rep1']]
titles = ['Smart-Seq2 rep2',
          'Smart-Seq2 rep1',
          'SMARTer',
          'K562',
          'ESC']
In [15]:
figsize(5,5)
for tt, nameu in zip(titles,namet):
    plt.clf()
    plt.loglog()
    for s in nameu:
        plt.scatter(df['%s_mean'%s], df['%s_cv2'%s], s = 0.1, label = s)
    plt.xlim(1e-6, 1e5)
    plt.ylim(1e-2, 1e2);
    plt.legend(bbox_to_anchor=(1.05, 1))
    plt.title(tt)
    plt.xlabel("Means")
    plt.ylabel("$CV^2$")
    plt.show()

Dropout

[PR2] split into 5 as well.

In [17]:
df.head()
Out[17]:
CL100029125_L01_12 CL100029125_L01_19 CL100029125_L01_20 CL100029125_L01_21 CL100029125_L02_23 CL100029125_L01_45 CL100029125_L02_25 CL100029125_L02_19 CL100029125_L02_37 CL100029125_L01_9 ... HiSeq-2000 SMARTer_drop HiSeq-2000 SMARTer_cv2 HiSeq-2000 Smart-Seq2 rep1_mean HiSeq-2000 Smart-Seq2 rep1_var HiSeq-2000 Smart-Seq2 rep1_drop HiSeq-2000 Smart-Seq2 rep1_cv2 HiSeq-2000 Smart-Seq2 rep2_mean HiSeq-2000 Smart-Seq2 rep2_var HiSeq-2000 Smart-Seq2 rep2_drop HiSeq-2000 Smart-Seq2 rep2_cv2
ENSMUSG00000107196.1 0 0 0 0 0 0 0 0 0 0 ... 0.979167 94.999492 0.000000 0.000000 1.000000 NaN 0.000000 0.000000 1.000000 NaN
ENSMUSG00000106901.1 0 0 0 0 0 0 0 0 0 0 ... 1.000000 NaN 0.005208 0.002577 0.989583 95.0 0.000000 0.000000 1.000000 NaN
ENSMUSG00000107077.1 0 0 0 0 0 0 0 0 0 0 ... 0.968750 63.134037 0.006944 0.004581 0.989583 95.0 0.002897 0.000504 0.983607 60.0
ENSMUSG00000097549.2 0 0 0 0 0 0 0 0 0 0 ... 1.000000 NaN 0.000000 0.000000 1.000000 NaN 0.000000 0.000000 1.000000 NaN
ENSMUSG00000104359.1 0 0 0 0 0 0 0 0 0 0 ... 1.000000 NaN 0.000000 0.000000 1.000000 NaN 0.055533 0.185037 0.983607 60.0

5 rows × 672 columns

In [18]:
for tt, nameu in zip(titles,namet):
    plt.clf()
    plt.xscale('log')
    for s in nameu:
        plt.scatter(df['%s_mean'%s], df['%s_drop'%s], s = 0.1, label = s)
    plt.xlim(1e-5, 1e5)
    plt.ylim(-0.1, 1.1);
    plt.legend(bbox_to_anchor=(1.05, 1))
    plt.xlabel("Means")
    plt.ylabel("Dropout rate")
    plt.show()

Heatmap

In [224]:
dfs = []
for f in ['bgi','sanger']:
    df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
    dfs.append(df)
df = pd.concat(dfs, axis=0)
In [225]:
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
In [102]:
dm = pd.read_csv("../data/GRCm38.cdna.all.symbol.tsv",index_col=1, sep=' ',header=None)
dn = pd.read_csv("../data/PlurinetGenes_Category.txt", sep = '\t', index_col=0)
In [103]:
dm = dm.loc[dn.index]
dm['funct_cat'] = dn['funct_cat']
dm = dm.dropna()
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
In [228]:
df = df[df.columns[df.columns.isin(dm[0])|df.columns.isin(['Protocol2','n_genes'])].tolist()]
In [104]:
dm['gene'] = dm.index
dm.index = dm[0]
In [105]:
dm.head()
Out[105]:
0 funct_cat gene
0
ENSMUSG00000022346.15 ENSMUSG00000022346.15 Cell cycle Myc
ENSMUSG00000006398.15 ENSMUSG00000006398.15 Cell cycle Cdc20
ENSMUSG00000072082.6 ENSMUSG00000072082.6 Cell cycle Ccnf
ENSMUSG00000027379.13 ENSMUSG00000027379.13 Cell cycle Bub1
ENSMUSG00000021611.9 ENSMUSG00000021611.9 Cell cycle Tert
In [232]:
df.head()
Out[232]:
n_genes ENSMUSG00000006920.14 ENSMUSG00000021069.17 ENSMUSG00000022528.7 ENSMUSG00000013089.15 ENSMUSG00000057666.18 ENSMUSG00000026380.10 ENSMUSG00000021196.14 ENSMUSG00000029687.16 ENSMUSG00000026398.14 ... ENSMUSG00000037171.6 ENSMUSG00000022346.15 ENSMUSG00000006398.15 ENSMUSG00000031665.7 ENSMUSG00000025809.15 ENSMUSG00000000730.13 ENSMUSG00000021835.14 ENSMUSG00000042745.9 ENSMUSG00000053113.3 Protocol2
CL100029125_L01_12 8422.0 794.932 0.0 284.0 1599.0 43340.10 6157.41 6812.700 3269.11 276.000 ... 1.0 99.9999 5932.00 419.0 885.0 2754.990 1230.0 260.0 97.0 BGISEQ-500 Smart-Seq2 rep2
CL100029125_L01_19 7628.0 853.000 747.0 42.0 153.0 24974.50 5002.86 1603.110 343.00 0.000 ... 0.0 0.0000 1.00 127.0 2453.0 72.000 238.0 0.0 0.0 BGISEQ-500 Smart-Seq2 rep2
CL100029125_L01_20 7319.0 0.000 0.0 9.0 155.0 4086.07 1503.00 248.791 237.00 62.000 ... 184.0 0.0000 107.00 42.0 253.0 107.000 1.0 0.0 27.0 BGISEQ-500 Smart-Seq2 rep2
CL100029125_L01_21 7578.0 592.000 0.0 281.0 1314.0 42617.90 3642.98 5474.120 2090.00 329.001 ... 0.0 0.0000 6336.98 201.0 1987.0 673.999 897.0 0.0 191.0 BGISEQ-500 Smart-Seq2 rep2
CL100029125_L02_23 7746.0 174.000 0.0 74.0 44.0 13201.00 1236.00 3236.400 1569.00 255.000 ... 8.0 0.0000 275.00 580.0 95.0 0.000 35.0 228.0 0.0 BGISEQ-500 Smart-Seq2 rep2

5 rows × 104 columns

In [233]:
df.columns = ['n_genes']+dm.loc[df.columns[1:-1]]['gene'].tolist()+['Protocol2']
In [234]:
df.to_csv("../pdata1/Heatmap.csv")
In [106]:
dm.funct_cat.value_counts()
Out[106]:
Pluripotency                27
Signaling                   19
Chromatin                   17
Metabolism                  14
Differentiation             10
Cell cycle                   5
Surface marker               4
Totipotent subpop marker     3
Housekeeping                 2
Germ cell                    1
Name: funct_cat, dtype: int64

Differentiation

In [100]:
df = pd.read_csv("../pdata1/Heatmap.csv", index_col=0)
In [107]:
dm[dm.funct_cat == 'Differentiation']['gene'].tolist() + ['Protocol2','n_genes']
Out[107]:
['T',
 'Gata3',
 'Sox1',
 'Sox7',
 'Eomes',
 'Pax6',
 'Gata6',
 'Gata4',
 'Sox17',
 'Cdx2',
 'Protocol2',
 'n_genes']
In [237]:
df.Protocol2.value_counts()
Out[237]:
HiSeq-2000 SMARTer                 96
BGISEQ-500 SMARTer                 96
HiSeq-2000 Smart-Seq2 rep1         96
BGISEQ-500 Smart-Seq2 rep1         95
BGISEQ-500 Smart-Seq2 rep2         72
HiSeq-2000 Smart-Seq2 rep2         61
Hiseq4000 K562 Sequencing rep1     37
BGISEQ-500 K562 Sequencing rep1    37
Hiseq4000 ESC Sequencing rep1      21
BGISEQ-500 ESC Sequencing rep1     21
Name: Protocol2, dtype: int64
In [108]:
df = df[dm[dm.funct_cat == 'Differentiation']['gene'].tolist() + ['n_genes','Protocol2']]

left side is BGI

In [127]:
def make_heatmap(df, key = 'HiSeq-2000 SMARTer', cut=5000,title=''):
    x = df[df.Protocol2.str.endswith(key)]
    z = x.iloc[:,:-1]
    z = z.sort_values(by='n_genes')
    y = z.iloc[:,:-1]
    plt.clf()
    plt.pcolormesh(np.log1p(y.T),cmap='RdYlBu_r')
    plt.ylabel('Genes', fontsize= 24)
    plt.xlabel('Cells', fontsize= 24)
    plt.axvline((x['n_genes']<cut).sum(), lw=2, c='r',linestyle='-')
    plt.gca().set_yticks(np.arange(y.shape[1])+0.5, minor=True)
    plt.gca().set_yticks([], minor=False)
    plt.gca().set_yticklabels(x.columns[:-2], minor=True)
    plt.gca().set_xticks([], minor=False)
    plt.title(title)
    plt.show()

red line is the cutoff of number of genes. 7500 for BGI and 5000 for Uk.

In [122]:
x = df[df.Protocol2.str.endswith('BGISEQ-500 ESC Sequencing rep1')]
In [123]:
z = x.sort_values(by='n_genes')
In [ ]:
 
In [128]:
figsize(5,3)
for k in df.Protocol2.unique():
    cut = 5000
    if k.find('BGI')>-1:cut=7500
    make_heatmap(df, key = k, cut=cut,title=k)
In [ ]:
 

Pluripotency

In [270]:
df = pd.read_csv("../pdata1/Heatmap.csv", index_col=0)
In [271]:
dm.funct_cat.value_counts()
Out[271]:
Pluripotency                27
Signaling                   19
Chromatin                   17
Metabolism                  14
Differentiation             10
Cell cycle                   5
Surface marker               4
Totipotent subpop marker     3
Housekeeping                 2
Germ cell                    1
Name: funct_cat, dtype: int64
In [272]:
df.Protocol2.value_counts()
Out[272]:
HiSeq-2000 SMARTer                 96
BGISEQ-500 SMARTer                 96
HiSeq-2000 Smart-Seq2 rep1         96
BGISEQ-500 Smart-Seq2 rep1         95
BGISEQ-500 Smart-Seq2 rep2         72
HiSeq-2000 Smart-Seq2 rep2         61
Hiseq4000 K562 Sequencing rep1     37
BGISEQ-500 K562 Sequencing rep1    37
Hiseq4000 ESC Sequencing rep1      21
BGISEQ-500 ESC Sequencing rep1     21
Name: Protocol2, dtype: int64
In [273]:
df = df[dm[dm.funct_cat == 'Pluripotency']['gene'].tolist() + ['n_genes','Protocol2']]
In [275]:
figsize(5,5)
for k in df.Protocol2.unique():
    cut = 5000
    if k.find('BGI')>0:cut=7500
    make_heatmap(df, key = k, cut=cut,title=k)
In [ ]:
 
In [240]:
figsize(5,5)
make_heatmap(df, key = 'Smart-Seq2 rep1')
In [241]:
figsize(5,5)
make_heatmap(df, key = 'Smart-Seq2 rep2')
In [242]:
figsize(5,5)
make_heatmap(df, key = 'SMARTer')
In [243]:
figsize(5,5)
make_heatmap(df, key = 'K562 Sequencing rep1')
In [244]:
figsize(5,5)
make_heatmap(df, key = 'ESC Sequencing rep1')

Non-overlapping genes

In [34]:
dfs = []
for f in ['bgi','sanger']:
    df = pd.read_csv("../pdata1/%s.csv"%f, index_col=0)
    dfs.append(df)
df = pd.concat(dfs, axis=0)
In [35]:
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
In [46]:
df1 = pd.read_csv("../pdata1/pseudobulk.csv", index_col=0)
In [36]:
key = 'Smart-Seq2 rep1'

df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
     df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
In [37]:
key = 'Smart-Seq2 rep2'

df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
     df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
In [38]:
key = 'SMARTer'

df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
     df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
In [47]:
df1.index = df1.index.str.replace('Seuqencing','Sequencing')
In [48]:
key = 'K562 Sequencing rep1'

df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
     df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)
In [41]:
key = 'ESC Sequencing rep1'

df2 = df1[df1.index.str.endswith(key)]
gl = df2.columns[((df2.iloc[0]>0)&(df2.iloc[1]==0))].tolist()+ \
     df2.columns[((df2.iloc[1]>0)&(df2.iloc[0]==0))].tolist()
df3 = df[gl + ['Protocol2']]
df2 = df3[df3.Protocol2.str.endswith(key)]
df2 = df2.iloc[:,list(arange(100))+list(arange(-1, -102, -1))[::-1]]
figsize(10,10)
make_heatmap(df2, key = key)

Insert size

In [2]:
df1 = pd.read_csv("../data/fLen_bgi.csv", index_col=0)
df2 = pd.read_csv("../data/fLen_sanger.csv", index_col=0)

df = pd.concat([df1,df2])

df.index = df.index.str.split('/').str[-1]
In [3]:
df.head()
Out[3]:
0 1 2 3 4 5 6 7 8 9 ... 991 992 993 994 995 996 997 998 999 1000
CL100029124_L01_12 8.284890e-31 9.816330e-06 1.833900e-05 1.972370e-05 1.108990e-05 2.517170e-06 1.071380e-06 4.285540e-06 6.428310e-06 4.285540e-06 ... 1.062910e-07 4.251650e-07 6.819520e-07 6.019870e-07 3.715240e-07 1.768220e-07 4.420550e-08 0.000043 0.000174 0.000260
CL100029124_L01_28 8.870040e-31 3.966090e-07 2.644060e-07 6.610140e-08 4.336570e-30 6.422650e-30 9.497030e-30 1.402060e-29 2.066570e-29 3.041160e-29 ... 2.979950e-06 1.986640e-06 4.966590e-07 1.073620e-17 1.073620e-17 1.073620e-17 1.073620e-17 0.000031 0.000123 0.000184
CL100029124_L01_38 8.139690e-31 1.161480e-05 7.743190e-06 1.935800e-06 3.979500e-30 5.893820e-30 8.715060e-30 4.557650e-07 1.823060e-06 2.734590e-06 ... 9.852150e-18 9.852150e-18 9.852150e-18 9.852150e-18 9.852150e-18 9.852150e-18 9.852150e-18 0.000017 0.000068 0.000102
CL100029124_L01_45 7.785850e-31 2.643520e-06 3.689320e-06 2.367560e-06 5.780930e-07 5.637610e-30 8.336200e-30 1.230680e-29 1.813970e-29 4.154280e-08 ... 4.835470e-06 1.208870e-06 9.423870e-18 9.423870e-18 9.423870e-18 9.423870e-18 9.423870e-18 0.000023 0.000092 0.000138
CL100029124_L02_32 6.973910e-31 6.799030e-07 1.610290e-06 3.041640e-06 3.712580e-06 2.361730e-06 9.809230e-07 3.749670e-06 1.109380e-05 1.472550e-05 ... 5.379660e-06 8.069480e-06 5.379660e-06 1.344910e-06 8.436460e-18 8.436460e-18 8.436460e-18 0.000039 0.000157 0.000235

5 rows × 1001 columns

In [8]:
df['Protocol2'] = phn.loc[df.index]['Protocol2']
df = df[~df.Protocol2.isna()]
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
In [9]:
df.shape
Out[9]:
(630, 1002)
In [10]:
df.columns = list(arange(1001)+1)+['Protocol2']
In [11]:
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [12]:
df.Protocol2.value_counts()
Out[12]:
HiSeq-2000 Smart-Seq2 rep1         96
BGISEQ-500 SMARTer                 96
HiSeq-2000 SMARTer                 95
BGISEQ-500 Smart-Seq2 rep1         95
BGISEQ-500 Smart-Seq2 rep2         72
HiSeq-2000 Smart-Seq2 rep2         60
Hiseq4000 K562 Sequencing rep1     37
BGISEQ-500 K562 Sequencing rep1    37
Hiseq4000 ESC Sequencing rep1      21
BGISEQ-500 ESC Sequencing rep1     21
Name: Protocol2, dtype: int64
In [13]:
colors = [matplotlib.colors.rgb2hex(plt.get_cmap("Paired")(i)) for i in range(0,10)]
In [14]:
cc = dict(zip(df.Protocol2.unique(),colors))
In [17]:
figsize(5,5)
for i in arange(df.shape[0]):
    plt.scatter(arange(1001)+1, df.iloc[i][:-1], s = 0.1, label = df.iloc[i][-1], c = cc[df.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[17]:
([<matplotlib.axis.XTick at 0x12a109c88>,
  <matplotlib.axis.XTick at 0x12a1095c0>,
  <matplotlib.axis.XTick at 0x12a109390>,
  <matplotlib.axis.XTick at 0x12a6d97b8>,
  <matplotlib.axis.XTick at 0x12a6d9c88>,
  <matplotlib.axis.XTick at 0x12a6e1198>],
 <a list of 6 Text xticklabel objects>)
In [ ]:
 
In [18]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep1')]
for i in arange(df1.shape[0]):
    plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[18]:
([<matplotlib.axis.XTick at 0x12a772630>,
  <matplotlib.axis.XTick at 0x12a70bdd8>,
  <matplotlib.axis.XTick at 0x12a70b8d0>,
  <matplotlib.axis.XTick at 0x12a934828>,
  <matplotlib.axis.XTick at 0x12a92b198>,
  <matplotlib.axis.XTick at 0x12a91f9b0>],
 <a list of 6 Text xticklabel objects>)
In [19]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep2')]
for i in arange(df1.shape[0]):
    plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[19]:
([<matplotlib.axis.XTick at 0x12a9615f8>,
  <matplotlib.axis.XTick at 0x12a964f60>,
  <matplotlib.axis.XTick at 0x12a964cf8>,
  <matplotlib.axis.XTick at 0x12ab13940>,
  <matplotlib.axis.XTick at 0x12ab13e10>,
  <matplotlib.axis.XTick at 0x12ab1a320>],
 <a list of 6 Text xticklabel objects>)
In [20]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('SMARTer')]
for i in arange(df1.shape[0]):
    plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[20]:
([<matplotlib.axis.XTick at 0x12ab3a320>,
  <matplotlib.axis.XTick at 0x12ab4cba8>,
  <matplotlib.axis.XTick at 0x12ab4c3c8>,
  <matplotlib.axis.XTick at 0x12ad65908>,
  <matplotlib.axis.XTick at 0x12ad65dd8>,
  <matplotlib.axis.XTick at 0x12ad6c2e8>],
 <a list of 6 Text xticklabel objects>)
In [21]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('ESC Sequencing rep1')]
for i in arange(df1.shape[0]):
    plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[21]:
([<matplotlib.axis.XTick at 0x12ad90710>,
  <matplotlib.axis.XTick at 0x12ad900f0>,
  <matplotlib.axis.XTick at 0x12ad94eb8>,
  <matplotlib.axis.XTick at 0x12aed33c8>,
  <matplotlib.axis.XTick at 0x12aeb2128>,
  <matplotlib.axis.XTick at 0x12aedb940>],
 <a list of 6 Text xticklabel objects>)
In [22]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('K562 Sequencing rep1')]
for i in arange(df1.shape[0]):
    plt.scatter(arange(1001)+1, df1.iloc[i][:-1], s = 0.1, label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[22]:
([<matplotlib.axis.XTick at 0x12af0b0b8>,
  <matplotlib.axis.XTick at 0x12af0ca20>,
  <matplotlib.axis.XTick at 0x12af0c6d8>,
  <matplotlib.axis.XTick at 0x12affd358>,
  <matplotlib.axis.XTick at 0x12affd828>,
  <matplotlib.axis.XTick at 0x12affdcf8>],
 <a list of 6 Text xticklabel objects>)

Coverage

In [23]:
df1 = pd.read_csv("../data/coverage_bgi.csv", index_col=0)
df2 = pd.read_csv("../data/coverage_sanger.csv", index_col=0)
phn = pd.read_csv("../data/phn.csv",index_col=0)
df = pd.concat([df1,df2], axis=0)

df.index = df.index.str.split('/').str[-1]
In [24]:
df.head()
Out[24]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
CL100029124_L01_12 0.023623 0.036296 0.047187 0.054767 0.054920 0.060736 0.059592 0.059710 0.059129 0.059074 0.057511 0.057206 0.057423 0.058895 0.058921 0.061339 0.061335 0.055574 0.016204 0.000557
CL100029124_L01_28 0.024580 0.037725 0.048140 0.055931 0.055831 0.061910 0.060384 0.059999 0.059907 0.059479 0.057508 0.056610 0.056575 0.058095 0.057722 0.060169 0.060578 0.053653 0.014312 0.000895
CL100029124_L01_38 0.026996 0.039082 0.047983 0.054802 0.054964 0.059027 0.057418 0.056112 0.057089 0.057300 0.055814 0.056178 0.055539 0.059096 0.056869 0.063819 0.065598 0.057757 0.016861 0.001696
CL100029124_L01_45 0.023437 0.036906 0.048471 0.056640 0.057062 0.062720 0.060768 0.060917 0.060976 0.060245 0.058028 0.057346 0.057036 0.057178 0.057443 0.058911 0.058989 0.052235 0.014138 0.000554
CL100029124_L02_32 0.022339 0.036751 0.049051 0.056504 0.057486 0.063056 0.062523 0.062156 0.061282 0.059827 0.058960 0.057597 0.056417 0.056797 0.057190 0.058179 0.058382 0.051513 0.013463 0.000527
In [25]:
df['Protocol2'] = phn.loc[df.index]['Protocol2']
df = df[~df.Protocol2.isna()]
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
In [26]:
df.shape
Out[26]:
(630, 21)
In [27]:
df.columns = list(np.linspace(0, 100, 20))+['Protocol2']
In [28]:
df['Protocol2'] = df['Protocol2'].str.replace('Seuqencing','Sequencing')
In [29]:
df.Protocol2.value_counts()
Out[29]:
HiSeq-2000 Smart-Seq2 rep1         96
BGISEQ-500 SMARTer                 96
HiSeq-2000 SMARTer                 95
BGISEQ-500 Smart-Seq2 rep1         95
BGISEQ-500 Smart-Seq2 rep2         72
HiSeq-2000 Smart-Seq2 rep2         60
Hiseq4000 K562 Sequencing rep1     37
BGISEQ-500 K562 Sequencing rep1    37
Hiseq4000 ESC Sequencing rep1      21
BGISEQ-500 ESC Sequencing rep1     21
Name: Protocol2, dtype: int64
In [30]:
colors = [matplotlib.colors.rgb2hex(plt.get_cmap("Paired")(i)) for i in range(0,10)]
In [31]:
cc = dict(zip(df.Protocol2.unique(),colors))
In [33]:
figsize(5,5)
for i in arange(df.shape[0]):
    plt.plot(np.linspace(0, 100, 20), df.iloc[i][:-1], label = df.iloc[i][-1], c = cc[df.iloc[i][-1]])
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
# plt.xticks([0,200,400,600,800,1000],[0,20,40,60,80,100])
Out[33]:
Text(0,0.5,'density')

[PR10]

In [34]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep1')]
for i in arange(df1.shape[0]):
    plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
             alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
Out[34]:
Text(0,0.5,'density')
In [35]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('Smart-Seq2 rep2')]
for i in arange(df1.shape[0]):
    plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
             alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
Out[35]:
Text(0,0.5,'density')
In [36]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('SMARTer')]
for i in arange(df1.shape[0]):
    plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
             alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
Out[36]:
Text(0,0.5,'density')
In [37]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('ESC Sequencing rep1')]
for i in arange(df1.shape[0]):
    plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
             alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
Out[37]:
Text(0,0.5,'density')
In [38]:
figsize(5,5)
df1 = df[df.Protocol2.str.endswith('K562 Sequencing rep1')]
for i in arange(df1.shape[0]):
    plt.plot(np.linspace(0, 100, 20), df1.iloc[i][:-1], label = df1.iloc[i][-1], c = cc[df1.iloc[i][-1]],\
             alpha=0.2)
plt.xlabel('length %', fontsize=18)
plt.ylabel('density', fontsize=18)
Out[38]:
Text(0,0.5,'density')
In [ ]:
 

Alternative Splicing Event

In [139]:
phn = pd.read_csv("../data/phn.csv",index_col=0)

df = pd.read_csv("../data/ASE_stat.csv", index_col=0, sep='\t')

df.columns = ['ASE']

df = df.loc[df.index.isin(phn.index)]

phn['ASE'] = 0

ds = phn['ASE']

ds.loc[df.index] = df['ASE'].tolist()

phn['ASE'] = ds
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:194: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [140]:
phn['Protocol2'] = phn['Protocol2'].str.replace("Seuqencing",'Sequencing')
In [141]:
phn.Protocol2.value_counts()
Out[141]:
BGISEQ-500 K562 Sequencing rep1           135
BGISEQ-500 ESC Sequencing rep1            122
BGISEQ-500 K562 Library rep               108
BGISEQ-500 SMARTer                         96
HiSeq-2000 SMARTer                         96
HiSeq-2000 Smart-Seq2 rep1                 96
BGISEQ-500 Smart-Seq2 rep1                 95
BGISEQ-500 ESC Library rep                 95
BGISEQ-500 K562 Sequencing rep2            87
BGISEQ-500 Smart-Seq2 rep2                 72
HiSeq-2000 Smart-Seq2 rep2                 61
BGISEQ-500 ESC Sequencing rep2             51
Hiseq4000 K562 Sequencing rep1             37
Hiseq4000 ESC Sequencing rep1              30
Hiseq4000 K562 Library rep                 28
Hiseq4000 ESC Library rep                  23
BGISEQ-500 Missed, ESC Sequencing rep1      3
Hiseq4000 Missed, ESC Sequencing rep1       3
Name: Protocol2, dtype: int64
In [142]:
df = phn[(phn['Protocol2'].str.startswith('BGISEQ-500 K562')|phn['Protocol2'].str.startswith('Hiseq4000 K562'))]
In [143]:
df = df[df['Protocol2'].str.endswith('rep1')]
In [144]:
df.Protocol2.value_counts()
Out[144]:
BGISEQ-500 K562 Sequencing rep1    135
Hiseq4000 K562 Sequencing rep1      37
Name: Protocol2, dtype: int64
In [ ]:
 
In [145]:
ax=sns.violinplot(x='Protocol2',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Protocol2',y='ASE',data=df, jitter=0.3)
In [147]:
df['Read_type'].value_counts()
Out[147]:
SE50     45
SE100    45
PE100    45
PE101    37
Name: Read_type, dtype: int64
In [148]:
ax=sns.violinplot(x='Read_type',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Read_type',y='ASE',data=df, jitter=0.3)

e6

In [13]:
phn = pd.read_csv("../data/phn.csv",index_col=0)

df = pd.read_table("../data/AS_stat.xls", index_col=0, sep='\t')

df.columns = ['ASE']
In [14]:
df = df.loc[df.index.isin(phn.index)]

phn['ASE'] = 0

ds = phn['ASE']

ds.loc[df.index] = df['ASE'].tolist()

phn['ASE'] = ds
In [15]:
ds.sum()
Out[15]:
983
In [ ]:
 
In [3]:
ls ../data/AS*
../data/ASE_stat.csv    ../data/AS_stat.xls     ../data/AS_stat_e6.csv
In [ ]:
 
In [16]:
phn['Protocol2'] = phn['Protocol2'].str.replace("Seuqencing",'Sequencing')
In [17]:
phn.Protocol2.value_counts()
Out[17]:
BGISEQ-500 K562 Sequencing rep1           135
BGISEQ-500 ESC Sequencing rep1            122
BGISEQ-500 K562 Library rep               108
HiSeq-2000 SMARTer                         96
BGISEQ-500 SMARTer                         96
HiSeq-2000 Smart-Seq2 rep1                 96
BGISEQ-500 Smart-Seq2 rep1                 95
BGISEQ-500 ESC Library rep                 95
BGISEQ-500 K562 Sequencing rep2            87
BGISEQ-500 Smart-Seq2 rep2                 72
HiSeq-2000 Smart-Seq2 rep2                 61
BGISEQ-500 ESC Sequencing rep2             51
Hiseq4000 K562 Sequencing rep1             37
Hiseq4000 ESC Sequencing rep1              30
Hiseq4000 K562 Library rep                 28
Hiseq4000 ESC Library rep                  23
Hiseq4000 Missed, ESC Sequencing rep1       3
BGISEQ-500 Missed, ESC Sequencing rep1      3
Name: Protocol2, dtype: int64
In [9]:
# df = phn[(phn['Protocol2'].str.startswith('BGISEQ-500 K562')|phn['Protocol2'].str.startswith('Hiseq4000 K562'))]
In [20]:
df = phn
In [21]:
df = df[df['Protocol2'].str.endswith('rep1')]
In [11]:
df.Protocol2.value_counts()
Out[11]:
BGISEQ-500 K562 Sequencing rep1    135
BGISEQ-500 K562 Library rep        108
BGISEQ-500 K562 Sequencing rep2     87
Hiseq4000 K562 Sequencing rep1      37
Hiseq4000 K562 Library rep          28
Name: Protocol2, dtype: int64
In [22]:
ax=sns.violinplot(x='Read_type',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Read_type',y='ASE',data=df, jitter=0.3)
In [ ]:
 
In [36]:
df = phn[(phn['Protocol2'].str.startswith('BGISEQ-500 ESC')|phn['Protocol2'].str.startswith('Hiseq4000 ESC'))]
In [37]:
df = df[df['Protocol2'].str.endswith('rep1')]
In [38]:
df.Protocol2.value_counts()
Out[38]:
BGISEQ-500 ESC Sequencing rep1    122
Hiseq4000 ESC Sequencing rep1      30
Name: Protocol2, dtype: int64
In [39]:
ax=sns.violinplot(x='Protocol2',y='ASE',data=df,inner=None, color=".9",cut=0)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax = sns.stripplot(x='Protocol2',y='ASE',data=df, jitter=0.3)
In [22]:
phn = pd.read_csv("../pdata/bgi_human_phn.csv", index_col=0)
In [33]:
phn = pd.read_csv("../pdata/bgi_mouse_phn.csv", index_col=0)
In [34]:
phn.head()
Out[34]:
Batch Cell Cell_line Fullname Place Platform Protocol Protocol2 Read_type Well repeat
CL100029125_L01_12 Smart-Seq2 rep2 38 ESC Smart-Seq2 rep2_38 China BGISEQ-500 Smart-Seq2 BGISEQ-500 Smart-Seq2 rep2 PE100 D2 Smart-Seq2 rep2
CL100029125_L01_19 Smart-Seq2 rep2 27 ESC Smart-Seq2 rep2_27 China BGISEQ-500 Smart-Seq2 BGISEQ-500 Smart-Seq2 rep2 PE100 C3 Smart-Seq2 rep2
CL100029125_L01_20 Smart-Seq2 rep2 39 ESC Smart-Seq2 rep2_39 China BGISEQ-500 Smart-Seq2 BGISEQ-500 Smart-Seq2 rep2 PE100 D3 Smart-Seq2 rep2
CL100029125_L01_21 Smart-Seq2 rep2 51 ESC Smart-Seq2 rep2_51 China BGISEQ-500 Smart-Seq2 BGISEQ-500 Smart-Seq2 rep2 PE100 E3 Smart-Seq2 rep2
CL100029125_L02_23 Smart-Seq2 rep2 81 ESC Smart-Seq2 rep2_81 China BGISEQ-500 Smart-Seq2 BGISEQ-500 Smart-Seq2 rep2 PE100 G9 Smart-Seq2 rep2
In [35]:
phn.Protocol2.value_counts()
Out[35]:
BGISEQ-500 ESC Sequencing rep1            122
BGISEQ-500 SMARTer                         96
BGISEQ-500 Smart-Seq2 rep1                 95
BGISEQ-500 ESC Library rep                 95
BGISEQ-500 Smart-Seq2 rep2                 72
BGISEQ-500 ESC Sequencing rep2             51
Hiseq4000 ESC Sequencing rep1              30
Hiseq4000 ESC Library rep                  23
Hiseq4000 Missed, ESC Sequencing rep1       3
BGISEQ-500 Missed, ESC Sequencing rep1      3
Name: Protocol2, dtype: int64
In [36]:
phn = phn[phn.Protocol2.str.startswith('BGISEQ-500 K562')|phn.Protocol2.str.startswith('BGISEQ-500 ESC')|\
          phn.Protocol2.str.startswith('Hiseq4000 K562')|phn.Protocol2.str.startswith('Hiseq4000 ESC')]
In [37]:
phn.shape
Out[37]:
(321, 11)
In [38]:
phn.Protocol2.value_counts()
Out[38]:
BGISEQ-500 ESC Sequencing rep1    122
BGISEQ-500 ESC Library rep         95
BGISEQ-500 ESC Sequencing rep2     51
Hiseq4000 ESC Sequencing rep1      30
Hiseq4000 ESC Library rep          23
Name: Protocol2, dtype: int64
In [20]:
ls ../pdata
bgi.csv            bgi_ds_e6.csv      bgi_se.csv         sanger_ds_e5.csv
bgi_ds_e4.csv      bgi_human_phn.csv  sanger.csv         sanger_ds_e6.csv
bgi_ds_e5.csv      bgi_mouse_phn.csv  sanger_ds_e4.csv
In [21]:
ls ../pdata1
Fig1BC.csv        bgi.csv           bgi_se.csv        sanger_ds_e5.csv
Fig2B.csv         bgi_ds_e4.csv     pseudobulk.csv    sanger_ds_e6.csv
FigS2C.csv        bgi_ds_e5.csv     sanger.csv
Heatmap.csv       bgi_ds_e6.csv     sanger_ds_e4.csv
In [39]:
df = pd.read_csv("../pdata/bgi.csv", index_col=0)
In [40]:
df.head()
Out[40]:
num_processed num_mapped percent_mapped global_fl_mode robust_fl_mode detection_limit accuracy detection_limit_ERCC accuracy_ERCC detection_limit_SIRV ... ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1 Total_counts study
CL100029124_L01_12 5582801 4671516 83.676921 148 148 45.389061 0.396999 92.975422 0.847053 3.761187 ... 0.0 0.0 0.0 1985.650 0.0000 0.0 165.7770 0.0 4.290458e+06 bgi
CL100029124_L01_28 11486345 9477188 82.508300 159 159 48.869730 0.593942 56.439247 0.790190 16.762763 ... 0.0 0.0 0.0 5652.560 0.0000 0.0 540.3240 0.0 8.173253e+06 bgi
CL100029124_L01_38 2031511 1612973 79.397700 159 159 42.721952 0.657302 81.399368 0.732718 5.813530 ... 0.0 0.0 0.0 831.124 8.0000 0.0 74.0746 0.0 1.354927e+06 bgi
CL100029124_L01_45 8935193 7810273 87.410233 148 148 36.034480 0.719165 100.744771 0.775227 3.142785 ... 0.0 0.0 0.0 3414.210 52.0431 0.0 400.8270 0.0 6.660325e+06 bgi
CL100029124_L02_32 8412736 6996281 83.162969 169 169 24.293444 0.659067 56.859643 0.768557 2.197963 ... 0.0 0.0 0.0 3810.470 41.0000 0.0 283.6570 0.0 5.608494e+06 bgi

5 rows × 34855 columns

In [41]:
df.shape
Out[41]:
(437, 34855)
In [ ]:
 
In [42]:
df = df.loc[phn.index]
 /Users/zmiao/bin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
In [43]:
df.shape
Out[43]:
(321, 34855)
In [44]:
df = df[~df.num_processed.isna()]
In [45]:
df
Out[45]:
num_processed num_mapped percent_mapped global_fl_mode robust_fl_mode detection_limit accuracy detection_limit_ERCC accuracy_ERCC detection_limit_SIRV ... ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1 Total_counts study
CL100037423_L01_11 15170946.0 10120611.0 66.710481 106.0 106.0 35151.149740 0.702067 36.596185 0.702067 inf ... 0.0 0.0 0.0 8435.15 652.49600 23.50370 568.349 0.0 9.945582e+06 bgi
CL100037423_L01_12 5623253.0 3807504.0 67.709989 117.0 117.0 49061.609878 0.857491 43.900762 0.857491 inf ... 0.0 0.0 1.0 1215.16 1.11947 3.18605 309.916 0.0 3.744477e+06 bgi
CL100037423_L01_14 9043635.0 5267437.0 58.244688 107.0 107.0 112336.182660 0.826834 75.865215 0.826834 inf ... 0.0 0.0 0.0 4188.08 0.00000 2.16117 343.300 0.0 5.126754e+06 bgi
CL100037423_L01_15 9981237.0 6310414.0 63.222765 107.0 107.0 22736.597017 0.712770 34.171003 0.712770 inf ... 0.0 0.0 0.0 4055.48 436.22400 13.99510 477.583 0.0 6.165945e+06 bgi
CL100037423_L01_16 12090756.0 7789963.0 64.429081 96.0 106.0 41574.620494 0.731896 48.994879 0.731896 inf ... 0.0 0.0 0.0 5494.20 959.25300 8.09529 241.904 0.0 7.618005e+06 bgi
CL100037423_L01_19 7409737.0 4476024.0 60.407326 96.0 107.0 28422.832282 0.762663 24.206731 0.762663 inf ... 0.0 0.0 0.0 1891.18 0.00000 6.41275 217.856 0.0 4.362610e+06 bgi
CL100037423_L01_23 13400447.0 7376375.0 55.045738 107.0 107.0 28766.284848 0.690047 37.889789 0.748319 inf ... 0.0 0.0 0.0 8218.81 468.27800 15.75370 522.813 0.0 7.174020e+06 bgi
CL100037423_L01_27 16339745.0 10366960.0 63.446278 96.0 107.0 71454.902480 0.739208 94.457761 0.739208 inf ... 0.0 0.0 0.0 10174.90 0.00000 10.35090 747.344 8.0 1.019003e+07 bgi
CL100037423_L01_28 7555578.0 5456576.0 72.219174 106.0 106.0 94078.007582 0.854372 55.293933 0.854372 inf ... 0.0 0.0 0.0 3211.68 6.29129 8.77445 267.771 0.0 5.343857e+06 bgi
CL100037423_L01_30 11259686.0 7082708.0 62.903246 86.0 107.0 145639.944006 0.829451 50.586273 0.829451 inf ... 0.0 0.0 0.0 3671.85 1.00000 1.06459 421.244 0.0 6.896884e+06 bgi
CL100037423_L01_31 8913573.0 5393433.0 60.508093 96.0 106.0 17547.186142 0.840669 32.654014 0.840669 inf ... 0.0 0.0 0.0 2052.06 3.00000 7.40348 380.341 0.0 5.248936e+06 bgi
CL100037423_L01_35 8185852.0 6251832.0 76.373626 108.0 108.0 68685.779219 0.962548 232.585563 0.962548 inf ... 0.0 0.0 0.0 5506.16 185.41600 4.43009 479.568 0.0 6.165338e+06 bgi
CL100037423_L01_36 12438636.0 7266102.0 58.415585 86.0 106.0 58458.427077 0.780429 45.013612 0.780429 inf ... 0.0 0.0 0.0 7185.26 43.09080 21.82270 507.045 0.0 7.049260e+06 bgi
CL100037423_L01_38 11875637.0 8156958.0 68.686488 107.0 107.0 25207.618268 0.711175 26.998193 0.711175 inf ... 0.0 0.0 1.0 4639.23 566.75200 8.33313 418.970 0.0 8.022718e+06 bgi
CL100037423_L01_39 9285548.0 6327999.0 68.148902 117.0 117.0 22737.342157 0.614524 28.696016 0.720103 inf ... 0.0 0.0 0.0 1233.65 0.00000 13.51610 291.717 0.0 6.170952e+06 bgi
CL100037423_L01_4 16948721.0 11742516.0 69.282608 107.0 107.0 45091.392530 0.815854 43.358997 0.815854 inf ... 0.0 0.0 1.0 8580.78 5.00000 237.29800 788.697 0.0 1.152195e+07 bgi
CL100037423_L01_40 7243882.0 4076979.0 56.281687 96.0 107.0 29786.680568 0.819399 69.116965 0.819399 inf ... 0.0 0.0 0.0 2835.28 11.00000 2.13597 148.057 0.0 3.967971e+06 bgi
CL100037423_L01_44 9910651.0 6194883.0 62.507327 107.0 107.0 23067.146322 0.749790 44.209051 0.749790 inf ... 0.0 0.0 0.0 3572.76 273.32300 6.04309 269.370 0.0 6.061048e+06 bgi
CL100037423_L01_47 3160603.0 1742145.0 55.120653 97.0 107.0 30459.424830 0.770349 53.081102 0.770349 inf ... 0.0 0.0 0.0 1903.89 74.72210 2.14757 149.922 0.0 1.682145e+06 bgi
CL100037423_L01_48 6509626.0 4699981.0 72.200477 117.0 117.0 23067.146325 0.808030 44.209051 0.808030 inf ... 0.0 0.0 0.0 2857.55 0.00000 12.55660 197.999 0.0 4.634569e+06 bgi
CL100037423_L01_7 10838428.0 7097693.0 65.486369 96.0 106.0 91709.090817 0.729670 31.502463 0.729670 inf ... 0.0 0.0 0.0 2987.52 40.11030 26.47130 516.701 0.0 6.966301e+06 bgi
Index_CWHPEI17060014 5693600.0 3724831.0 65.421368 128.0 128.0 94171.675480 0.841117 41.670378 0.841117 inf ... 0.0 0.0 0.0 2937.60 123.95400 1.03644 388.279 0.0 3.674754e+06 bgi
Index_CWHPEI17060015 12641770.0 6671338.0 52.772183 118.0 118.0 30511.583673 0.845575 14.877587 0.845575 inf ... 0.0 0.0 0.0 5031.43 6.23730 4.14732 438.761 1.0 6.550096e+06 bgi
Index_CWHPEI17060016 5448381.0 2990950.0 54.896124 117.0 117.0 24637.355231 0.774592 18.576259 0.774592 inf ... 0.0 0.0 0.0 2968.86 0.00000 1.03876 258.014 0.0 2.908718e+06 bgi
Index_CWHPEI17060017 5252938.0 3161191.0 60.179484 86.0 106.0 23792.219038 0.716136 20.906717 0.716136 inf ... 0.0 0.0 0.0 1931.98 195.74800 5.00988 260.290 0.0 3.111967e+06 bgi
Index_CWHPEI17060019 5375103.0 3082938.0 57.355887 149.0 149.0 58513.901354 0.866993 23.068747 0.866993 inf ... 0.0 0.0 0.0 1396.21 139.46800 4.13642 200.664 0.0 3.022632e+06 bgi
Index_CWHPEI17060020 5925595.0 2844366.0 48.001357 106.0 106.0 24348.185408 0.730708 34.561792 0.730708 inf ... 0.0 0.0 0.0 3351.05 25.30420 4.15135 258.024 0.0 2.762961e+06 bgi
Index_CWHPEI17060021 6336101.0 3820792.0 60.301943 107.0 107.0 12246.669549 0.666267 21.338764 0.666267 inf ... 0.0 0.0 1.0 5169.01 4.71148 8.36113 348.660 0.0 3.739427e+06 bgi
Index_CWHPEI17060023 6697811.0 3714522.0 55.458746 86.0 107.0 29609.624679 0.791529 16.764603 0.791529 inf ... 0.0 0.0 1.0 2687.80 2.25101 3.14152 584.033 0.0 3.648098e+06 bgi
Index_CWHPEI17060024 6396930.0 3938384.0 61.566783 75.0 106.0 23492.477212 0.787831 26.725733 0.787831 inf ... 0.0 0.0 0.0 3207.75 0.00000 73.68960 335.345 0.0 3.868441e+06 bgi
Index_CWHPEI17060032 5548845.0 3228323.0 58.180090 139.0 139.0 48139.946539 0.823260 33.495560 0.823260 inf ... 0.0 0.0 0.0 2362.56 221.12400 2.08399 263.058 0.0 3.176142e+06 bgi
Index_CWHPEI17060033 5126152.0 2947844.0 57.505981 118.0 118.0 89574.911880 0.745012 27.432514 0.745012 inf ... 0.0 0.0 0.0 2519.84 2.00000 0.00000 297.052 0.0 2.876949e+06 bgi
Index_CWHPEI17060035 4777708.0 2282941.0 47.783184 96.0 106.0 30204.933693 0.794632 60.571953 0.794632 inf ... 0.0 0.0 0.0 2141.63 3.30180 0.00000 261.619 0.0 2.223008e+06 bgi
Index_CWHPEI17060037 4329018.0 1858968.0 42.942025 106.0 106.0 26115.260184 0.812990 34.957637 0.812990 inf ... 0.0 0.0 0.0 3465.80 97.29630 0.00000 223.221 0.0 1.782769e+06 bgi
Index_CWHPEI17060038 5533371.0 3072786.0 55.531899 170.0 170.0 23600.886801 0.721231 16.373053 0.721231 inf ... 0.0 0.0 0.0 1401.32 1.01513 0.00000 164.844 0.0 2.997108e+06 bgi
Index_CWHPEI17060039 4603530.0 2264838.0 49.197855 107.0 107.0 13853.195103 0.837851 15.333973 0.837851 inf ... 0.0 0.0 1.0 1415.70 3.00000 1.04048 234.364 0.0 2.209109e+06 bgi
Index_CWHPEI17060040 5423977.0 2648210.0 48.824138 117.0 117.0 16980.082745 0.779361 19.978690 0.779361 inf ... 0.0 0.0 0.0 3592.59 178.81200 1.02564 235.501 0.0 2.573712e+06 bgi
Index_CWHPEI17060041 4699359.0 2558661.0 54.447021 127.0 127.0 18384.822323 0.749897 29.105157 0.749897 inf ... 0.0 0.0 0.0 2141.57 193.02400 1.04247 281.707 0.0 2.498282e+06 bgi
Index_CWHPEI17060042 6020061.0 3329618.0 55.308709 106.0 106.0 53178.762653 0.801280 38.786856 0.801280 inf ... 0.0 0.0 0.0 3169.89 10.28990 4.19722 405.632 0.0 3.262866e+06 bgi
Index_CWHPEI17060043 6837560.0 4348251.0 63.593606 128.0 128.0 36795.358309 0.861710 28.404785 0.861710 inf ... 0.0 0.0 0.0 3281.23 1.00704 3.11551 258.469 0.0 4.281518e+06 bgi
Index_CWHPEI17060044 5735205.0 2539824.0 44.284799 44.0 118.0 32500.395995 0.860466 61.446239 0.860466 inf ... 0.0 0.0 1.0 2671.73 11.00000 0.00000 182.664 0.0 2.473398e+06 bgi
Index_CWHPEI17060047 4842497.0 2566052.0 52.990265 44.0 106.0 55031.848030 0.829801 26.131753 0.829801 inf ... 0.0 0.0 0.0 2311.92 309.15500 4.21960 139.409 0.0 2.508490e+06 bgi

42 rows × 34855 columns

In [46]:
df1 = df.replace([np.inf, -np.inf], np.nan).dropna(subset=["detection_limit_SIRV"], how="all")
In [47]:
df1
Out[47]:
num_processed num_mapped percent_mapped global_fl_mode robust_fl_mode detection_limit accuracy detection_limit_ERCC accuracy_ERCC detection_limit_SIRV ... ENSMUSG00000109159.1 ENSMUSG00000107243.1 ENSMUSG00000110425.1 ENSMUSG00000066315.9 ENSMUSG00000073771.11 ENSMUSG00000107035.1 ENSMUSG00000080242.5 ENSMUSG00000081408.1 Total_counts study

0 rows × 34855 columns

In [ ]:
 
In [2]:
ls ../supinfo/
TableS2_sanger.csv        TableS5_bgi_ds_e6.csv     bgi_mouse_phn.csv
TableS3_sanger_ds_e6.csv  TableS6_bgi_se.csv        phn.csv
TableS4_bgi.csv           bgi_human_phn.csv
In [3]:
from glob import iglob
In [7]:
for f in iglob('../supinfo/Tab*.csv'):
    df = pd.read_csv(f,index_col=0)
    fo = f.replace(".csv",'.1.csv')
    df = df[df.columns[~df.columns.str.startswith('ENS')]]
    df.to_csv(fo)
In [ ]:
 
In [ ]:
 

End

In [ ]: